Fundamentals of Linear Algebra For Signal Processing 2022 09 22
Fundamentals of Linear Algebra For Signal Processing 2022 09 22
Fundamentals of Linear Algebra For Signal Processing 2022 09 22
Processing
©
James P. Reilly
Professor Emeritus
Department of Electrical and Computer Engineering
McMaster University
DRAFT
1 Fundamental Concepts 1
1.0.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.4 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
2.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
v
4.6 Alternate Differentiation of the Quadratic Form . . . . . . . . 119
vi
6 The QR Decomposition 165
vii
7.2 The Least-Squares Solution . . . . . . . . . . . . . . . . . . . 199
viii
8.1 The Pseudo–Inverse . . . . . . . . . . . . . . . . . . . . . . . 231
ix
10 Regularization 269
x
Preface
This book is intended for graduate students who require a background in
linear algebra, or for practitioners who are entering a new field which requires
some familiarity with this topic. The book will give the reader familiarity
with the basic linear algebra toolkit that that is required in many disciplines
of modern engineering and science relating to signal processing, including
machine learning, signal processing, control theory, process control, applied
statistics, robotics, etc. Above all, this is a teaching text, where the emphasis
is placed on understanding and interpretation of the material.
The first chapter, some fundamental ideas required for the remaining por-
tion of the book are established. First, we look at some fundamental ideas
of linear algebra such as linear independence, subspaces, rank, nullspace,
range, etc., and how these concepts are interrelated. A review of matrix
multiplication from a more advanced perspective is introduced.
i
Chapter 4 deals with the quadratic form and its relation to the eigende-
composition. The multi–variate Gaussian probability function is discussed,
and the concept of joint confidence regions are presented. In Chapter 5, a
brief introduction to numerical issues encountered when dealing with floating
point number systems is presented. Then Gaussian elimination is discussed
in some length. The Gaussian elimination process is described through a
bigger–block matrix approach, that leads to other useful decompositions,
such as the Cholesky decomposition of a square symmetric positive definite
matrix. The condition number of a matrix, which is a critical part in de-
termining a lower bound on the relative error in the solution of a system of
linear equations, is also developed.
ii
Chapter 1
Fundamental Concepts
1.0.1 Notation
1
Similarly, the notation a ∈ Rm (Cm ), belongs to the Cartesian product taken
m times, implies a vector of m elements which are taken from the set of real
(complex) numbers. When referring to a single vector, we use the term
dimension or length to denote the number of elements.
Also, we indicate that a scalar a is from the set of real (complex) numbers
by the notation a ∈ R(C). Thus, an upper case bold character denotes
a matrix, a lower case bold character denotes a vector, and a lower case
non-bold character denotes a scalar.
The Matrix Inverse The inverse A−1 of a matrix A is defined such that
2
AA−1 = A−1 A = I. To be “invertible”, A must be square and full rank.
More on this later.
Trace: The trace of a matrix A, denoted as tr(A), is the sum of its diagonal
elements.
3
where A ∈ Rm×n = [a1 , . . . , an ]. To see this, we can depict the product Ac
in the following form for the 3 × 3 case:
a d g 1
y = b e h 2 . (1.3)
c f i 3
Note in this example, that all elements of the first column of A are multiplied
only by the coefficient c1 = 1, that the entire second column is multiplied
only by c2 = 2, and all elements of the third column of A are multiplied only
by c3 = 3. Therefore (1.2) can be written in the form y = c1 a1 +c2 a2 +c3 a3 ,
which is identical to (1.1). If C were a matrix with e.g., two columns instead
of one, then the resulting Y would be a matrix with two columns. In this
case, the second column of Y would also be a linear combination of the
columns of A, but in this case the coefficients are from the second column
of C.
It is very important that the reader understand the concept that y in (1.2) is
a linear combination of the columns of A, since it is one of the fundamental
ideas of linear algebra. In this vein it helps greatly to visualize an entire
column as being a single entity, rather than treating each element individ-
ually. We present an additional example to illustrate the concept further.
Consider the following depiction of matrix–vector multiplication
c1
c2
..
y=
.
cn
···
c
A
4
corresponding P
column ai ; i.e., coefficient ci interacts only with the column
ai . Thus, y = i ai ci which is the same result as (1.2).
yT = cT A T
a b c
= 1 2 3 d e f
g h i
1a 1b 1c
= +2d +2e +2f
+3g +3h +3i
Xn
= ci aTi (1.4)
i=1
where aTi is the ith row of AT in this case. With respect to the middle
equation above, note that y T is a 1 × 3 row vector, where the summation
corresponding to each element is represented in a column format for clarity
of presentation. In a transposed manner corresponding to (1.3), note the ith
element of c interacts only with the ith row aTi of AT , i = 1, . . . , 3. Thus,
using similar logic as the column case, we see that the row vector y T in this
case is a linear combination of the rows of AT , whose coefficients are the
elements of cT .
Here it is implied that c takes on the infinite set of all possible values within
Rn , and consequently {y} is the set of all possible linear combinations of
the columns ai . The dimension of S denoted as dim(S) is the number
of independent directions that span the space; e.g., the dimension of the
universe we live in is 3. In the case where n = 2 and if a1 and a2 are
5
linearly independent (to be defined), then S is a two–dimensional plane
which contains a1 and a2 .
a1
* a2
Figure 1.1. A vector set containing two linearly independent vectors. The dimension of
the corresponding vector space S is 2.
to the plane of the paper were added to the set, then the resulting vector
space would be the three–dimensional universe. A third example is shown
in Figure 1.2. Here, since none of the vectors a1 . . . , a3 have a component
which is orthogonal to the plane of the paper, all linear combinations of this
vector set, and hence the corresponding vector space, lies in the plane of the
paper. Thus, in this case, dim(S) = 2, even though there are three vectors
in the set.
a1
* a2
- a
3
Figure 1.2. A vector set containing three linearly dependent vectors. The dimension of the
corresponding vector space S is 2, even though there are three vectors in the set. This is
because the vectors all lie in the plane of the paper.
6
1.1.2 Linear Independence
Example 1
1 2 1
A = [a1 a2 a3 ] = 0 3 −1 (1.7)
0 0 1
This set is linearly independent. On the other hand, the set
1 2 −3
B = [b1 b2 b3 ] = 0 3 −3 (1.8)
1 1 −2
is not. This follows because the third column is a linear combination of the
first two. (−1 times the first column plus −1 times the second equals the
third column). Thus, the coefficients of the vector c in (1.6) which results
in zero are any scalar multiple of (1, 1, 1). We will see later that this vector
defines the null space of B.
7
Span: The span of a vector set [a1 , . . . , an ], written as span[a1 , . . . , an ], is
the vector space S corresponding to this set; i.e.,
n
X
m
S = span [a1 , . . . , an ] = y ∈ R y=
cj aj , cj ∈ R ,
j=1
where in this case the coefficients cj are each assumed to take on the infi-
nite range of real numbers. In the above, we have dim(S) ≤ n, where the
equality is satisfied iff the vectors ai are linearly independent. Note that
the argument of span(·) is a vector set.
Note that [ai1 , . . . aik ] is not necessarily a basis for the subspace S. This
set is a basis iff it is a maximally independent set. This idea is discussed
shortly. The set {ai } need not be linearly independent to define the span
or subspace.
For example, the vectors [a1 , a2 ] in Fig. 1 define a subspace (the plane of
the paper) which is a subset of the three–dimensional universe R3 .
Thus, we see that R(A) is the vector space consisting of all linear combi-
nations of the columns ai of A, whose coefficients are the elements xi of
x. Therefore, R(A) ≡ span[a1 , . . . , an ]. The distinction between range and
span is that the argument of range is a matrix, whereas we have seen that
the argument of span is a vector set. We have dim[R(A)] ≤ n, where the
equality is satisfied iff the columns are linearly independent. Any vector
y ∈ R(A) is of dimension (length) m.
8
Example 3:
1 5 3
A= 2 4 3 (the last column is the arithmetic average of the first two)
3 3 3
Bases
is
e1 = (1, 0, 0)T
e2 = (0, 1, 0)T .
9
Note that any linearly independent set in span[e1 , e2 ] is also a basis.
Recall for any pairP of vectors x, y ∈ Rm , their dot product, or inner product
c is defined as c = m T T
i=1 xi yi = x y, where (·) denotes transpose. Further,
recall that two vectors are orthogonal iff their inner product is zero. Now
suppose we have a subspace S of dimension r corresponding to the vectors
[a1 , . . . , an ], for r ≤ n ≤ m; i.e., the respective matrix A is tall and the
ai ’s are not necessarily linearly independent. With this background, the
orthogonal complement subspace S⊥ of S of dimension m − r is defined as
S⊥ = y ∈ Rm |y T x = 0 for all x ∈ S
(1.10)
10
1.1.4 Rank
11
A ∈ Rm×2 and B ∈ R2×n , depicted by the following diagram:
x x x x
= x x x x .
B
C A
where the symbol x represents the respective element of B. Then, the rank of
C is at most two. To see this, we realize from our discussion on representing
a linear combination of vectors by matrix multiplication, that the ith column
of C is a linear combination of the two columns of A whose coefficients are
the ith column of B. Thus, all columns of C reside in the vector space
R(A). If the columns of A and the rows of B are linearly independent, then
the dimension of this vector space is two, and hence rank(C) = 2. If the
columns of A or the rows of B are linearly dependent and non–zero, then
rank(C) = 1. This example can be extended in an obvious way to matrices
of arbitrary size.
N (A) = x ∈ Rn | Ax = 0 ,
(1.11)
where the trivial value x = 0 is normally excluded from the space. From
previous discussions, the product Ax is a linear combination of the columns
ai of A, where the elements xi of x are the corresponding coefficients. Thus,
from (2.22), N (A) is the set of non–zero coefficients of all zero linear combi-
nations of the columns of A. Therefore if N (A) is non-empty, then A must
have linearly dependent columns and thus be column rank deficient.
12
dimension n. Any vector in N (A) is orthogonal to the rows of A, and is
thus in the orthogonal complement subspace of the rows of A.
The four matrix subspaces of concern are: the column space, the row space,
and their respective orthogonal complements. The development of these four
subspaces is closely linked to N (A) and R(A). We assume for this section
that A ∈ Rm×n , r ≤ min(m, n), where r = rank(A).
13
where columns of A are the rows of AT . From above, we see that N (AT )
is the set of x ∈ Rm which is orthogonal to all columns of A (rows of AT ).
This by definition is the orthogonal complement of R(A). Any vector in
R(A)⊥ is of dimension m.
The Row Space The row space is defined simply as R(AT ), with dimension
r. The row space is the span of the rows of A. Any vector in R(AT ) is of
dimension n.
Thus it is apparent that the set x satisfying the above is N (A). Any vector
in R(AT )⊥ is of dimension n.
14
product C as
C = A B
(1.13)
m×n m×k k×n
Note that the matrix multiplication operation requires the inner dimensions
of A and B to be equal. Such matrices are said to have conformable dimen-
sions. Four interpretations of the matrix multiplication operation follow:
1. Inner-Product Representation
2. Column Representation
3. Row Representation
This is the transpose operation of the column representation above. The ith
row cTi of C can be written as a linear combination of the rows bTj of B,
15
whose coefficients are given as the ith row of A, i.e.,
k
X
cTi = aij bTj , i = 1, . . . , m.
j=1
4. Outer–Product Representation
This is the largest–block representation. Let ai and bTi be the ith column
and row of A and B respectively. Then the product C may also be expressed
as
Xk
C= ai bTi . (1.15)
i=1
16
A11 . . . A1p m1
.. .. ..
A = . . .
Aq1 . . . Aqp mq
s1 . . . sp
B 11 . . . B 1r s1
.. .. ..
B = . . .
B p1 . . . B pr sp
n1 . . . nr
Notice that for each term in the above, the number of columns of the kth
A–block is equal to the number of rows in the kth B–block, which is the
dimension sk , k = 1, . . . , p in the above equations. This way, the blocks
have conformable dimensions with regard to matrix multiplication. Also
the numer of blocks p in a row of A must equal the number of blocks
in a column of B. Eq. (1.16) can be proved by verifying that for any
element cij , (1.16) performs exactly the same operations as ordinary matrix
multiplication performs in evaluating the same element.
V T1 ΛV T1
Λ 0
V1 V2 = V1 V2
0 0 V T2 0
= V 1 ΛV T1 .
17
1.4 Vector Norms
where p can assume any positive value. Below we discuss commonly used
values for p:
p = 1:
X
||x||1 = |xi |
i
p = 2:
!1
2
X 1
||x||2 = xi 2 = (xT x) 2
i
which is the familiar Euclidean norm. As implied from the above, we have
the important identity ||x||22 = xT x.
18
p = ∞:
||x||∞ = max |xi | ,
i
which is the element of x with the largest magnitude. This may be shown
in the following way. As p → ∞, the largest term within the round brackets
in (1.17) dominates all others in the summation. Therefore (1.17) may be
written as
"m #1
X p p 1
||x||∞ = lim |x|i → [|xk |p ] p
p→∞
i=1
= |xk |
Note that the p = 2 norm has many useful properties, but is expensive to
compute. The 1– and ∞–norms are easier to compute, but are more difficult
to deal with algebraically. All the p–norms obey all the properties of a vector
norm.
Figure 1.3 shows the locus of points of the set {x | ||x||p = 1} for p = 1, 2, ∞.
We now consider the relation between ||x||1 and ||x||2 for some point x,
(assumed not to be on a coordinate axis, for the sake of argument). Let x
be a point which lies on the ||x||2 = 1 locus. Because the p = 1 locus lies
inside the p = 2 locus, the p = 1 locus must expand outwards (i.e., ||x||1
must assume a larger value), to intersect the p = 2 locus at the point x.
Therefore we have ||x||1 ≥ ||x||2 . The same reasoning can be used to show
the same relation holds for ||x||1 and ||x||2 vs. ||x||∞ . Even though we
have considered only the 2-dimensional case, the same argument is readily
extended to vectors of arbitrary dimension. Therefore we have the following
generalization: for any vector x, we have
1.5 Determinants
19
1.5
1
p=infinity
0.5 p=1
-0.5
p=2
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 1.3. Locus of points of the set {x | ||x||p = 1} for various values of p.
or
m
X
det(A) = aij cij , j ∈ (1 . . . m). (1.19)
i=1
Both the above are referred to as the cofactor expansion of the determinant.
Eq. (1.18) is along the ith row of A, whereas (1.19) is along the jth column.
20
It is indeed interesting to note that both versions above give exactly the
same number, regardless of the value of i or j.
Eqs. (1.18) and (1.19) express the m × m determinant detA in terms of the
cofactors cij of A, which are themselves (m − 1) × (m − 1) determinants.
Thus, m − 1 recursions of (1.18) or (1.19) will finally yield the determinant
of the m × m matrix A.
Properties of Determinants
21
5. det(A) = m
Q
i=1 λi , where λi are the eigen (singular) values of A.
This means the parallelopiped defined by the column or row vectors
of a matrix may be transformed into a regular rectangular solid of the
same m– dimensional volume whose edges have lengths corresponding
to the eigen (singular) values of the matrix.
where the cij are the cofactors of A. According to (1.18) or (1.19), the ith
row ãTi of à times the ith column ai is det(A); i.e.,
a˜Ti aj = 0, i 6= j. (1.22)
Then, combining (1.21) and (1.22) for i, j ∈ {1, . . . , m} we have the following
interesting property:
ÃA = det(A)I, (1.23)
2
An orthonormal matrix is defined in Chapter 2.
22
where I is the m × m identity matrix. It then follows from (1.23) that the
inverse A−1 of A is given as
23
1.6 Problems
3. On Avenue 2 Learn for this course you will find a matlab file Ch1Q3.mat
which contains a short matrix A. Determine the orthogonal comple-
ment of the row space, using matlab. What can you infer about the
rank of A from your results? Also find, using matlab, an orthonormal
basis for R(A) as well as its orthogonal complement subspace..
4. You will also find the .mat file Ch1Q4.mat, which contains a 6 × 3
matrix A1 along with vectors b1 and b2 .
24
a pair of microphones. The impulse responses f1 and f2 model the
reverberation effect of the room. Under certain conditions, the source
signal x[n] can be recovered without error, and without knowledge of
f1 and f2 , using the concepts developed in this chapter. The sequence
x[n] is of length m and f1 [n] and f2 [n] are FIR sequences of length
n m. The outputs y1 [n] and y2 [n] are the convolution of x[n] with
f1 [n] and f2 [n] respectively; i.e.,
X
yi [n] = fi [k]x[n − k], i ∈ [1, 2],
k
6. The speech signal used to generate the sequences y1 [n] and y2 [n] in
Problem 5 is SPF1.mat, which can be found on Avenue. Also available
are the signals y1 [n] and y2 [n] in the file Ch1Q6.mat.
It is possible to recover the source signal x[n] from the observations
y1 [n] and y2 [n] knowing f1 and f2 or equivlently g1 and g2 . If the
sequences f are of length n, then there exist sequences w1 [n], w2 [n] of
length n − 1 that satisfy the expression
25
impulse function and thus the original speech x[n] is recovered. Find
the sequences w1 [n], w2 [n] and recover the source x[n].
The true impulse responses f1 and f2 may be found in file Ch1Q6.mat
so you can compare your responses with the true values. You can play
the speech file through your computer sound system by issuing the
command “soundsc(vector )” within Matlab.
26
Figure 1.4. Configuration for blind deconvolution
27
28
Chapter 2
We first discuss this subject from the classical mathematical viewpoint, and
then when the requisite background is in place we will apply eigenvalue and
eigenvectors in a signal processing context. We investigate the underlying
ideas of this topic using the matrix A as an example:
4 1
A= (2.1)
1 4
29
5 Ax3
1 Ax1
x Note ccw rotation of Ax
1 2 3 4 5
The product Ax1 , where x1 = [1, 0]T , is shown in Fig. 2.1. Then,
4
Ax1 = . (2.2)
1
By comparing the vectors x1 and Ax1 we see that the product vector is
scaled and rotated counter–clockwise with respect to x1 .
Now consider the case where x2 = [0, 1]T . Then Ax2 = [1, 4]T . Here, we
note a clockwise rotation of Ax2 with respect to x2 .
We now let x3 = [1, 1]T . Then A x3 = [5, 5]T . Now the product vector
points in the same direction as x3 ; i.e., Ax3 ∈ span(x3 ) and Ax3 = λx3 .
Because of this property, x3 = [1, 1]T is an eigenvector of A. The scale
factor (which in this case is 5) is given the symbol λ and is referred to as an
eigenvalue.
Ax = λx (2.3)
30
the form Ax = λIx, or
(A − λI)x = 0 (2.4)
where I is the n × n identity matrix. Thus to be an eigenvector, x must lie
in the nullspace of Ax − λI. We know that a nontrivial solution to (2.4)
exists if and only if N (A − λI) is non–empty, which implies that
It is easily verified that the roots of this polynomial are (5,3), which corre-
spond to the eigenvalues indicated above.
31
In the case where there are r ≤ n repeated eigenvalues, then a linearly
independent set of n eigenvectors still exist(provided rank(A − λI) = n − r).
However, their directions are not unique in this case. In fact, if [v 1 . . . v r ]
are a set of r linearly independent eigenvectors associated with a repeated
eigenvalue, then any vector in span[v 1 . . . v r ] is also an eigenvector. The
proof is left as an exercise.
—————–
32
Proof. Let {v i } and {λi }, i = 1, . . . , n be the eigenvectors and correspond-
ing eigenvalues of A ∈ Rn×n . Choose any i, j ∈ [1, . . . , n], i 6= j. Then
Av i = λi v i (2.8)
and
Av j = λj v j . (2.9)
v Tj Av i = λi v Tj v i (2.10)
v Ti Av j = λj v Ti v j (2.11)
The quantities on the left are equal when A is symmetric. We show this as
follows. Since the left-hand side of (2.10) is a scalar, its transpose is equal
to itself. Therefore, we get v Tj Av i = v Ti AT v j 3 . But, since A is symmetric,
AT = A. Thus, v Tj Av i = v Ti AT v j = v Ti Av j , which was to be shown.
(λi − λj )v Tj v i = 0 (2.12)
Here we have considered only the case where the eigenvalues are distinct.
If an eigenvalue λ̃ is repeated r times, and rank(A − λ̃I) = n − r, then a
mutually orthogonal set of n eigenvectors can still be found.
i.e, for a symmetric matrix, an element aij = aji . A Hermitian symmetric (or just
Hermitian) matrix is relevant only for the complex case, and is one where A = AH , where
superscript H denotes the Hermitian transpose. This means the matrix is transposed and
complex conjugated. Thus for a Hermitian matrix, an element aij = a∗ji .
In this book we generally consider only real matrices. However, when complex matrices
are considered, Hermitian symmetric is implied instead of symmetric.
3
Here, we have used the property that for matrices or vectors A and B of conformable
size, (AB)T = BT AT .
33
Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.
Proof: from [5]. (By contradiction): First, we consider the case where A
is real. Let λ be a non–zero complex eigenvalue of a symmetric matrix A.
Then, since the elements of A are real, λ∗ , the complex–conjugate of λ, must
also be an eigenvalue of A, because the roots of the characteristic polynomial
must occur in complex conjugate pairs. Also, if v is a nonzero eigenvector
corresponding to λ, then an eigenvector corresponding λ∗ must be v∗ , the
complex conjugate of v. But Property 1 requires that the eigenvectors be
orthogonal; therefore, vT v∗ = 0. But vT v∗ = (vH v)∗ , which is by definition
the complex conjugate of the norm of v. But the norm of a vector is a pure
real number; hence, vT v∗ must be greater than zero, since v is by hypothesis
nonzero. We therefore have a contradiction. It follows that the eigenvalues
of a symmetric matrix cannot be complex; i.e., they are real.
While this proof considers only the real symmetric case, it is easily extended
to the case where A is Hermitian symmetric.
4
The trace denoted tr(·) of a square matrix is the sum of its diagonal elements.
34
The proof is straightforward, but because it is easier using concepts pre-
sented later in the course, it is not given here.
q Ti q j = δij , (2.13)
where δij is the Kronecker delta, and q i and q j are columns of the orthonor-
mal matrix Q. When i = j, the quantity q Ti q i defines the squared 2–norm
of q i , which has been defined as unity. When i 6= j, q Ti q j = 0, due to the
orthogonality of the q i ). We therefore have
QT Q = I. (2.14)
Eq. (2.14) follows directly from the fact Q has orthonormal columns. It is
not so clear that the quantity QQT should also equal the identity. We can
35
resolve this question in the following way. Suppose that A and B are any
two square invertible matrices such that AB = I. Then, BAB = B. By
parsing this last expression, we have
(BA) · B = B. (2.15)
||Qx||22 = xT QT Qx = xT x = ||x||22 .
Consider the case where we have a tall matrix U ∈ Rm×n , where m > n,
whose columns are orthonormal. U can be formed by extracting only the
first n columns of an arbitrary orthonormal matrix. (We reserve the term
orthonormal matrix to refer to a complete m × m matrix). Because U has
orthonormal columns, it follows that the quantity U T U = I n×n . However,
36
it is important to realize that the quantity U U T 6= I m×m in this case, in
contrast to the situation when m ≤ n. This fact is easily verified, since
rank(U U T ) = n, which less than m, and so cannot be the identity.
Avi = λi vi , i = 1, . . . , n. (2.16)
AV = VΛ (2.17)
37
where V = [v1 , v2 , . . . , vn ] (i.e., each column of V is an eigenvector), and
λ1 0
λ2
Λ= = diag(λ1 . . . λn ). (2.18)
..
.
0 λn
A = VΛVT . (2.19)
VT AV = Λ.
38
is common convention to order the eigenvalues so that
The eigenvectors are reordered to correspond with the ordering of the eigen-
values. For notational convenience, we refer to the eigenvector corresponding
to the largest eigenvalue as the “largest eigenvector” or “principal eigenvec-
tor”. The “smallest eigenvector” is then the eigenvector corresponding to
the smallest eigenvalue.
Here we discuss only the case where A is square and symmetric, and we
entertain the possibility that the matrix A may be rank deficient; i.e., we
are given that rank(A) = r ≤ n. We write the eigendecomposition of A as
A = V ΛV T . We partition V and Λ in the folllowing block formats:
V1 V2
V =
r n−r
where
V 1 = [v 1 , v 2 , . . . , v r ] ∈ Rn×r
V 2 = [v r+1 , . . . , v n ] ∈ Rn×n−r .
39
responding to the n − r smallest eigenvalues. We also have
Λ1 0
Λ=
0 Λ2
N (A) = x ∈ Rn | Ax = 0
Define
V T1
c1
c= = V Tx = x (2.23)
c2 V T2
where c1 ∈ Rr and c2 ∈ Rn−r . We rewrite Ax = 0 in the form
Λ1 0 c1
V1 V2 = 0. (2.24)
0 Λ2 c2
5
With regards to this discussion on nullspace, the value of r in (2.21) is not necessarily
taken to be rank(A). The value of r being equal to rank is established in the discussion
on range, to follow.
40
Note that the expression that Ax = 0 can also be written as Ax = 0x,
which implies that A has at least one zero eigenvalue. Thus this expression
can only be satisfied for x 6= 0 (and therefore also c 6= 0) iff some of
the diagonal elements of Λ are zero. Since by definition Λ1 contains the
eigenvalues of largest magnitude, we put all the non–zero eigenvalues in
Λ1 and the zeros in Λ2 . Then (2.24) is satisfied for nonzero x if c1 = 0
and c2 6= 0. From (2.23), this implies that x ∈ R(V 2 ). Thus, V 2 is an
orthonormal basis for N (A).
We now turn our attention to range. Recall from (1.9) that R(A) is defined
as
R(A) = {y ∈ Rm | y = Ax, for x ∈ Rn } .
From the fact that Λ2 = 0 and using (2.24), the expression y = Ax becomes
y = V 1 Λ1 c1 . (2.25)
The vector y spans an r–dimensional space as c1 varies throughout its re-
spective universe iff V 1 consists of r columns and Λ1 contains r non–zero
diagonal elements. Since we have defined rank as the dimension of R(A),
V 1 must have exactly r columns and Λ1 must contain r non–zero values.
Further, it is seen that V 1 is an orthonormal basis for R(A).
41
2.2 An Alternate Interpretation of Eigenvectors
Now that we have the requisite background in place, we can present an alter-
native to the classical interpretation of eigenvectors that we have previously
discussed. This alternate interpretation is very useful in the science and en-
gineering contexts, since it sets the stage for principal component analysis,
which is a widely used tool in many applications in signal processing.
Consider the m × n matrix X as above, whose ith row is xTi and where we
have assumed zero mean, as before. Let θi = xTi q, where q is a vector with
unit 2–norm to be determined. We address the question “What is the q so
that the variance E(θi )2 = E(xTi q)2 is maximum when taken over all values
of i?” The quantity θi = xTi q is the projection of the ith observation xi
onto the unit–norm vector q. The problem at hand is therefore equivalent
to determining the direction for which these projections have maximum
variation, on average. For example, with reference to Fig. 2.2, it is apparent
that the direction along the [1 1]T axis corresponds to the solution in this
case.
42
where λ is referred to as a Lagrange multiplier, whose value is to be deter-
mined. The constrained solution x∗ then satisfies
dL(x∗ )
= 0.
dx
Applying the Lagrange multiplier method to (2.27), the Lagrangian is given
by
L(q) = q T Rx q + λ(1 − q T q).
As shown in the appendix of this chapter, and from [6]7 , the derivative of
the first term (when the associated matrix is symmetric) is 2Rx q. It is
straightforward to show that with respect to the second term,
d
(1 − q T q) = −2q.
dq
2Rx q − 2λq = 0
or
Rx q = λq. (2.28)
We therefore have the important result that the stationary points of the
constrained optimization problem (2.27) are given by the eigenvectors of
Rx . Thus, the vector q ∗ onto which the observations should be projected
for maximum variance is given as the principal eigenvector v 1 of the matrix
Rx . This direction coincides with the major axis of the scatterplot ellipse.
The direction which results in minimum variance of θ is the smallest eigen-
vector v n . Each eigenvector aligns itself along one of the major axes of the
corresponding scatterplot ellipse. In the practical case where only a finite
quantity of data is available, we replace the expected value Rx in (2.27)
with its finite sample approximation given by (2.33).
7
There is a link to this document on the course website.
43
2.3 Covariance and Covariance Matrices
Here we “change gears” for a while and discuss the idea of covariance, which
is a very important topic in any form of statistical analysis and signal pro-
cessing. In Section 2.5, we combine the topics of eigen–analysis and covari-
ance matrices. The definitions of covariances vary somewhat across various
books and articles, but the fundamental idea remains unchanged. We start
the discussion with a review of some fundamental definitions.
We are given two scalar random variables x1 and x2 . Recall that the mean
µ and variance σ 2 of a random variable are defined as
µ = E(x) and
σ 2 = E(x − µ)2 ,
where E is the expectation operator. The covariance σ12 and correlation ρ12
between the variables x1 and x2 are defined respectively as
44
We often dispense with the hat notation since in most cases the context is
clear. The hats are used only if necessary to avoid ambiguity. The above
equations can be written in a more compact manner if we asemble the avail-
able samples into a vector x ∈ Rn . In this case we can write
1
σj2 = (xj − µ)T (xj − µ), j ∈ [1, 2] (2.29)
n
1
σ12 = (x1 − µ1 )T (x2 − µ2 ). (2.30)
n
where in the above we define subtraction of a scalar from a vector as the
scalar being subtracted from each element of the vector.
Because the presentation is easier in the case where the means are zero, and
because most of the real–life signals we deal with do have zero mean, from
this point onwards, we assume either that the means of the variables are
zero, or that we have subtracted the means as a pre–processing step – i.e.,
xj ← xj − µj . Variables of this sort are referred to as mean centered data.
The final example is a bit contrived, but it nevertheless illustrates the point.
Here we assume x1 is the maximum speed at which a person can run and
again x2 remains as the person’s corresponding weight. Then generally,
45
5
x2 in normalized units
2
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
x1 in normalized units
Figure 2.2. Figures 2.2 – 2.4: Scatter plots for different cases of random vectors [x1 x2 ]T
for different values of covariance, for mean–centered data. Fig. 2.2: covariance σ12 =
+0.8. Fig. 2.3: covariance σ12 = 0, and Fig. 2.4: covariance σ12 = −0.8. The axes
are normalized to zero mean and standard deviation = 1. Each point in each figure
represents one observation [x1 , x2 ]T of the random vector x. In each figure there are 1000
observations.
the greater the person’s weight, the slower they run. So in this case, the
variables most often have opposite signs, so the covariance σ12 is negative. A
non-zero covariance between two random variables implies that one variable
affects the other, whereas a zero covariance implies there is no (first–order)
relationship between them.
In the example of Fig. 2.2, the effect of a positive covariance between the
variables is to cause the respective scatterplot to become elongated along
the major axis, which in this case is along the direction [1, 1]T , where the
46
4
2
x2 in normalized units
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
x1 in normalized units
2
x2 in normalized units
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
x1 in normalized units
47
elements have the same sign. Note that the direction [1, 1]T coincides with
that of the first eigenvector – see Sect. 2.2. In this case, due to the pos-
itive correlation, mean–centered observations where the height and weight
are simultaneously either larger or smaller than the means (i.e., heigth and
weight have the same sign) are relatively common, and therefore observa-
tions relatively far from the mean along this direction have a relatively high
probability, and so the variance of the observations along this direction is
relatively high. On the other hand, again due to the positive correlation,
mean–centered observations along the direction [1, −1]T ] (i.e., where height
and weight have opposite signs) that are far from the mean have a lower
probability, with the result the variance in this direction is smaller (i.e., tall
and skinny people occur more rarely than tall and correspondingly heavier
people). In cases where the governing distribution is not Gaussian, similar
behaviour persists, although the scatterplots will not be elliptical in shape.
As a further example, take the limiting case in Fig. 2.2 where σ12 → 1.
Then, the knowledge of one variable completely specifies the other, and the
scatterplot devolves into a straight line. In summary, we see that as the value
of the covariance increases from zero, the scatterplot ellipse transistions from
being circular (when the variance of the variables are equal) to becoming
elliptical, with the eccentricity of the ellipse increasing with covariance until
eventually the ellipse collapses into a line as σ12 → 1.
48
We recognize the diagonal elements as the variances σ12 . . . σn2 of the elements
x1 . . . xn respectively, and the (i, j)th off–diagonal element as the covariance
σij between xi and xj . Since multiplication is commutative, σij = σji and
so Rx is symmetric. It is also apparent that covariance matrices are square.
It therefore follows that its eigenvectors are orthogonal.
Note that the rows xTi of X in the last term (2.33) are the transpose of the
xi ’s in the middle term. It would perhaps be more straightforward if we
represented X in its transposed version where each xi forms a column of
X. Then X would be an n × m matrix where each column is an observation
and there would be a more direct correspondence between the summation
term in the middle and the outer product term on the right. However the
formulation in (2.33) is the only way to abide by the common conventions
that vectors are columns and that X is formulated so that its rows consist
of observations.
49
Note that m ≥ n for R̂ to be full rank. The covariance matrices for each of
our three examples discussed above are given as
1 +0.8 1 0 1 −0.8
R1 = , R2 = , and R3 = .
+0.8 1 0 1 −0.8 1
Note that 1’s on each diagonal result because the variances of x1 and x2
have been normalized to unity.
The word stationary as used above means the random process is one for
which the corresponding joint n–dimensional probability density function
describing the distribution of the vector sample xT does not change with
time. This means that all moments of the distribution (i.e., quantities such
as the mean, the variance, and all covariances, (as well as all other higher–
order statistical characterizations) are invariant with time. Here however,
we deal with a weaker form of stationarity referred to as wide–sense sta-
tionarily (WSS). With these processes, only the first two moments (mean,
variances and covariances) need be invariant with time. Strictly, the idea of
a covariance matrix is only relevant for stationary or WSS processes, since
expectations only have meaning if the underlying process is stationary. How-
ever, we see later that this condition can be relaxed in an approximate sense
50
- ...
1me I
I •
J_ .a _,_
?t'l.
)
pa
•
... VY'\
f\
•
Figure 2.5. The received signal x[k] is decomposed into windows of length n. The samples
in the ith window comprise the vector xi , i = 1, 2, . . . m.
2
A sample of a white Gaussian random process withµ = 1 and σ = 1.
3
0
w[k]
-1
-2
-3
-4 •
0 10 20 30 40 50 60 70 80 90 100
time
Figure 2.6. A sample of a white Gaussian discrete–time process, with mean µ = 1 and
variance σ 2 = 1.
51
in the case of a slowly–varying non–stationary signal, if the expectation is
replaced with a time average over an interval over which the signal does not
vary significantly.
Taking the expectation over all windows, eq. (2.34) tells us that the element
r(1, 1) of Rx is by definition E(x21 ), which is the variance of the first element
x1 over all possible vector samples xi of the process. But because of station-
arity, r(1, 1) = r(2, 2) = . . . , = r(n, n) which are all equal to σx2 . Thus all
main diagonal elements of Rx are equal to the variance of the process. The
element r(1, 2) = E(x1 x2 ) is the covariance between the first element of xi
and its second element. Taken over all possible windows, we see this quantity
is the covariance of the process and itself delayed by one sample. Because of
stationarity, the elements r(1, 2) = r(2, 3) = . . . = r(n − 1, n) and hence all
elements on the first upper diagonal are equal to the covariance for a time-
lag of one sample. Since multiplication is commutative, r(2, 1) = r(1, 2),
and therefore all elements on the first lower diagonal are also all equal to
this same cross-correlation value. Using similar reasoning, all elements on
the jth upper or lower diagonal are all equal to the covariance value of the
process for a time lag of j samples. A matrix with equal elements along any
diagonal is referred to as Toeplitz.
If we compare the process shown in Fig. 2.5 with that shown in Fig. 2.6, we
see that in the former case the process is relatively slowly varying. Because
we have assumed x[k] to be mean–centered, adjacent samples of the process
in Fig. 2.5 will have the same sign most of the time, and hence E(xi xi+1 )
will be a positive number, coming close to the value E(x2i ). The same can
be said for E(xi xi+2 ), except it is not quite so close to E(x2i ). Thus, we see
that for the process of Fig. 2.5, the diagonals decay fairly slowly away from
the main diagonal value.
However, for the process shown in Fig. 2.6, adjacent samples are uncorre-
lated with each other. This means that adjacent samples are just as likely
52
to have opposite signs as they are to have the same signs. On average, the
terms with positive values have the same magnitude as those with negative
values. Thus, when the expectations E(xi xi+1 ), E(xi xi+2 ) . . . are taken, the
resulting averages approach zero. In this case, we see the covariance matrix
concentrates around the main diagonal, and becomes equal to σx2 I. We note
that all the eigenvalues of Rx are equal to the value σx2 . Because of this
property, such processes are referred to as “white”, in analogy to white light,
whose spectral components are all of equal magnitude.
When x[k] is stationary, the sequence {r(1, 1), r(1, 2), . . . , r(1, n)} is the au-
tocorrelation function of the process, for lags 0 to n−1. In the Gaussian case,
the process is completely characterized9 by the autocorrelation function. In
fact, it may be shown [7] that the Fourier transform of the autocorrelation
function is the power spectral density of the process. Further discussion on
this aspect of random processes is beyond the scope of this treatment; the
interested reader is referred to the reference.
Some Properties of Rx :
53
incident wave 1
wavelength lambda
wavefront 1
sensors
(antennas, theta1
etc. ) o
o
Normal to array
o
o
d
incident wave K
thetaK
wavefront K
54
(MUltiple SIgnal Classification) algorithm [9] for this purpose. A broader
treatment of the array signal processing field is given in [10].
where
S = [s1 . . . sK ]
10
It may be shown that if d ≤ λ/2, then there is a one–to–one relationship between
the electrical angle φ and the corresponding physical angle θ. In fact, φ = 2πd
λ
sin θ. If
d ≤ λ/2, then θ can be inferred from φ.
55
φk , k = 1, . . . , K are the electrical phase–shift angles corresponding to the
incident signals. The φk are assumed to be distinct.
The MUSIC algorithm requires that K < M . Before we discuss the imple-
mentation of the MUSIC algorithm per se, we analyze the covariance matrix
R of the received signal x:
= SE(aaH )SH + σ 2 I
= SAS H + σ 2 I (2.37)
where A = E(aa)H . The second line follows because the noise is uncorre-
lated with the signal, thus forcing the cross–terms to be zero. In the last
line of (2.37) we have also used that fact that the covariance matrix of the
noise contribution (second term) is σ 2 I. This follows because the noise is
assumed white. We refer to the first term of (2.37) as Ro , which is the con-
tribution to the covariance matrix due only to the signal component. The
matrix A ∈ CK×K is full rank if the incident signals are not fully correlated.
In this case, Ro ∈ CM ×M is rank K < M . Therefore Ro has K non-zero
eigenvalues and M − K zero eigenvalues.
R o vi = 0
or SASH vi = 0, i = K + 1, . . . , M.
S H V N = 0, (2.38)
56
Property 3 of this chapter, we see that if the eigenvalues of Ro are λi , then
because here we dealing with expectations and because the noise is white,
those of R are λi + σ 2 . The eigenvectors remain unchanged with the noise
contribution, and therefore (2.39) still holds when noise is present, under
the current assumptions.
and then extract the M − K noise eigenvectors V N , which are those associ-
ated with the smallest M − K eigenvalues of R̂. Because of the finite N and
the presence of noise, (2.39) only holds approximately for the true φo . Thus,
a reasonable estimate of the desired directions of arrival may be obtained
by finding values of the variable φ for which the expression on the left of
(2.39) is small instead of exactly zero. Thus, we determine K estimates φ̂
which locally satisfy
2
φ̂ = arg min sH (φ)V̂N . (2.40)
φ 2
1
P (φ) = H
.
s(φ)H V̂ N V̂N s(φ)
It will look something like what is shown in Fig. 2.8, when K = 2 incident
signals. Estimates of the directions of arrival φk are then taken as the peaks
of the MUSIC spectrum.
57
P(phi)
phi
phi1 phi2
Signal and noise subspaces: The MUSIC algorithm opens up some in-
sight into the use of the eigendecomposition that will be of use later on. Let
us define the so–called signal subspace SS as
SS = span [v 1 , . . . , v K ] (2.41)
We have seen earlier in this chapter that the eigenvectors associated with
the non–zero eigenvectors form a basis for R(Ro ). Therefore
R(Ro ) ∈ Ss . (2.44)
Comparing (2.43) and (2.44), we see that S ∈ SS . From (2.36) we see that
any received signal vector x, in the absence of noise, is a linear combination
of the columns of S. Thus, any noise–free signal resides completely in SS .
This is the origin of the term “signal subspace”. Further, in the presence of
noise, provided N → ∞, (2.41) and (2.42) still hold and then any component
of the received signal residing in SN must be entirely due to the noise,
58
although noise can also reside in the signal subspace. This is the origin of
the term “noise subspace”. In the case where N is finite, the eigenvectors are
only approximations to their ideal values, and (2.41) and (2.42) hold only
approximately. This results in some leakage between the two subspaces and
(2.38) holding only approximately, resulting in some error in the estimates of
φ, as one would expect under these circumstances. We note that the signal
and noise subspaces are orthogonal complement subspaces of each other.
The ideas surrounding this example lead to the ability to de–noise a signal
in some situations, as we see in a subsequent example.
PCA is the second example how covariance matrices and eigen–analysis can
be applied to real–life problems. The basic idea behind PCA is that, given
a set of observations xi ∈ Rn , i = 1, . . . m, is to transform each observa-
tion xi into a new basis so that as much variance is concentrated into as
few coefficients as possible. The motivation for this objective, as we see
shortly, is that it provides the means for data compression, can be used to
denoise a signal and also provides useful features for classification problems
in a machine learning context. In this section, we assume the process x
is slowly varying, so that a significant degree of correlation exists between
consecutive elements of xi . In this case, the n–dimensional scatterplot is a
hyperellipse, ideally with significant variation along only a few of the axes.
PCA is sometimes referred to as the Karhunen Loeve transform.
θ i = V T xi . (2.45)
The motivation for using the eigenvector basis to represent x follows from
Sect. 2.2, where we have seen that the principal coefficients (i.e., the projec-
tions of xi onto the principal eigenvectors) have maximum variance. When
59
the elements of x are correlated, we have the result that
E(θ1 )2 ≥ E(θ2 )2 ≥ . . . ≥ E(θr )2 ≥ E(θr+1 )2 ≥ . . . ≥ E(θn )2 , (2.46)
where the [θ1 , . . . , θn ]T are the elements of θ i . This phenomenon is the key to
PCA compression. A justification of this behaviour is given in Property 10,
following. In a typical practical situation, the variances of the θ–elements
fall off quite quickly. We therefore determine a value of r such that the
coefficients θr+1 , . . . , θn are deemed negligible and so are neglected, and we
retain only the first r significant values of θ. Thus the n–length vector θ is
represented by a truncated version θ̂ ∈ Rr given by
θ1
θ̂ = ... .
θr
This set of r coefficients are the principal components. The reconstructed
version x̂ of x is then given as
x̂ = V r θ̂. (2.47)
where V r consists of the first r columns of V . We note that the entire
length-n sequence x can be represented by only r ≤ n coefficents. The
resulting error ∗r = E||x − x̂||22 can be shown to be the smallest possible
value with respect to the choice of basis (see Property 12, following). Thus
data compression is achieved with low error. Typically in highly correlated
systems, r can be significantly less than n and the compression ratios (i.e.,
r
n ) attainable with the PCA method can be substantial.
If we substitute the first r values of θ from (2.45) into (2.47), we have the
result
x̂ = V r V Tr x.
60
Since x̂ may be viewed as a projection of x into the principal component
subspace, it is apparent from the above that this projection operation is
accomplished by pre–multiplication of x by the matrix V r V Tr . This matrix
is referred to as a projector. Projectors are explored in more detail in Ch.3.
To prove this, we evaluate the covariance matrix Rθθ of θ, using the defini-
tion (2.45) as follows:
Rθθ = E θθ T
= E V T xxT V
= V T Rx V
= Λ. (2.48)
Property 9 The variance of the ith PCA coefficient θi is equal to the ith
eigenvalue λi of Rx .
The proof follows directly from prop. (8) and the fact that the ith diagonal
element of Rθθ is the variance of θi .
From this property, we can infer that the length of the ith semi–axis of
the scatterplot ellipse is directly proportional to the square root of the ith
eigenvalue, which is equal to the variance of θi . The next property shows
that the eigenvalues indeed become smaller, and therefore so do the variances
of the θi , with increasing index i.
61
Property 10 The variances of the θ coefficients decrease with index as in
(2.46).
Consider the scatterplot of Fig. 2.2, which shows variation of the samples in
the x1 − x2 axis. Due to the positive correlation between x1 and x2 , samples
where these variables have the same sign (after removal of the mean) are
more likely than the case where they have different signs. Therefore samples
farther from the mean along the principal eigenvector direction [1, 1]T have a
higher probability of occurring than those the same distance from the mean
along the axis [1, −1]T , which is the direction of the second eigenvector.
The result of this behaviour is that E(θ1 )2 ≥ E(θ2 )2 , as is evident from the
Figure. Now consider the variation in the x2 − x3 plane, where we aassume
n ≥ 3. Because of stationarity, we can assume that x2 and x3 have the same
correlation structure as that of x1 and x2 . We can therefore apply the same
argument as that above to show that E(θ2 )2 ≥ E(θ3 )2 . By continuing to
apply the same argument in all n dimensions, (2.46) is justified.
To obtain further insight into the behavior of the two sets of eigenvalues, we
consider Hadamard’s inequality [11] which may be stated as:
62
Qm
Consider a square matrix A ∈ Rm×m . Then, det A ≤ i=1 aii ,
with equality if and only if A is diagonal.
From
Qn Hadamard’sQn inequality, det Rc < detPRw , and Pso also from Property 4,
α
i=1 i < β
i=1 i . Under the constraint α i = βi with the βi all equal,
it follows that α1 > αn ; i.e., the eigenvalues of Rc are not equal. (We say the
eigenvalues become disparate). Thus, according to prop.(9), the variances
in the first few PCA coefficients of a correlated process are larger than those
in the later PCA coefficients. In practice, when x[k] is highly correlated, the
variances in the later coefficients become negligible.
Property 11 The mean–squared error ∗r = Ei ||xi − x̂i ||22 in the PCA rep-
resentation x̂i of xi using r components is given by
m
X
∗r = Ei ||xi − x̂i ||22 = λi , (2.49)
i=r+1
Proof:
∗r = Ei ||xi − x̂i ||22 = Ei ||V θ − V θ̂||22
= E||(θ − θ̂)||22
Xm
= E(θi )2
i=r+1
Xm
= λi , (2.50)
i=r+1
where in the last line we have used prop (9) and the second line follows due
to the fact that the 2–norm is invariant to multiplication by an orthonormal
matrix.
Property 12 The eigenvector basis provides the minimum ∗r for a given
value of r.
11
We see later in Ch.4 that Rx is positive definite and so all the eigenvalues are positive.
Thus each truncated term can only increase the error.
63
Proof: Recall θ = V x = [V r V 2 ]x = [θ r θ 2 ] where θ 2 are the coefficients
which become truncated. Then from our discussions from Sect. 2.2, there
is no other basis for which ||θ r ||22 is greater. Because V is an orthonormal
transformation, ||θ r ||22 +||θ 2 ||22 = ||x||22 . Since ||x||22 is invariant to the choice
of basis and ||θ r ||22 is maximum, ||θ 2 ||22 = ∗r must be minimum which respect
to basis.
64
Sample of a white noise sequence (red) and its low-pass filtered version (blue)
4
2
w[k], x[k]
-1
-2
-3
0 20 40 60 80 100 120 140 160 180 200
time
Figure 2.9. A sample of a white noise sequence (red) and its corresponding filtered version
(blue). The white noise sequence is a Gaussian random process with µ = 0 and σ 2 = 1,
generated using the matlab command “randn”.
correlated. Vector samples xTi ∈ Rn are extracted from the sequence x[k]
in the manner shown in Fig. 2.5 and assembled into rows of the data ma-
trix X. The filter removes the high-frequency components from the input
and so the resulting output process x[k] must therefore vary more slowly in
time and therefore exhibit a significant covariance structure. As a result,
we expect to be able to accurately represent the original signal using only
a few principal eigenvector components, and be able to achieve significant
compression gains.
The sample covariance matrix R̂x was then computed from X as in (2.35)
for the value n = 10. Listed below are the 10 eigenvalues of R̂x :
65
Eigenvalues:
0.5468
0.1975
0.1243 × 10−1
0.5112 × 10−3
0.2617 × 10−4
0.1077 × 10−5
0.6437 × 10−7
0.3895 × 10−8
0.2069 × 10−9
0.5761 × 10−11
Inspection of the eigenvalues above indicates that a large part of the total
variance is contained in the first two eigenvalues. We therefore choose
P r = 2.
The error ?r for r = 2 is thus evaluated from the above data as 10 i=3 λi =
0.0130, which may be compared to the value 0.7573, which is the total
eigenvalue sum. The normalized error is 0.0130
0.7573 = 0.0171. Because this error
may be considered a low enough value, only the first r = 2 components
may be considered significant. In this case, we have a compression gain of
10/2 = 5.
We now present an example showing how the PCA process can be used to
denoise a signal. If the signal of interest has significant correlation structure,
we can take advantage of the fact that the eigenvalues of the covariance
66
First Two Principal Components (10 x 10 system)
0.5
0.4
0.3
v1
0.2
0.1
0
x
−0.1
−0.2
−0.3 v2
−0.4
−0.5
1 2 3 4 5 6 7 8 9 10
time
Figure 2.10. First two eigenvector components as functions of time, for Butterworth low-
pass filtered noise example.
0.3
0.2
0.1
0
Amplitude
−0.1
−0.2
−0.3
−0.4
−0.5
−0.6
1 2 3 4 5 6 7 8 9 10
Time
Figure 2.11. Original vector samples of x as functions of time (solid), compared with their
reconstruction using only the first two eigenvector components (dotted). Three vector
samples are shown.
67
matrix are typically concentrated into only a few (i.e., r) significant values,
just as they are in the previous case where we were interested in compression.
Thus, in a manner similar to the compression case, we see that in the present
case the signal component is concentrated into a subspace whose basis is the
eigenvector matrix V r = [v 1 , . . . , v r ]. This subspace is exactly analogous to
the signal subspace associated with the MUSIC algorithm of Sect. 2.5.1. The
noise typically has significant contributions over all n eigen–components,
so if r is appreciably less than n which is the case when the signal has
significant correlation structure, reconstructing the signal using only the
signal subspace components has the effect of suppressing a large part of the
noise. The process we follow for this example is identical to the previous
compression example, except here the effect we are interested in is denoising
rather than compression.
68
Example of a Gaussian Pulse
1.2
0.8
Gaussian pulse
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80 90 100
time
1.2
1
Gaussian Pulse
0.8
0.6
0.4
0.2
-0.2
0 10 20 30 40 50 60 70 80 90 100
time
Figure 2.13. 50 superimposed, simulated pulses corrupted by timing jitter, ampliude vari-
ation and additive coloured noise.
69
to the conventional PCA process as discussed. It was empirically determined
that the best value for r in this case is 3.
In the finite data case, the principal eigenvectors of R̂ are only an approxi-
mation to the true signal subspace basis, with the result that there will be
some degree of noise leaking into the estimated signal subspace. Thus the
denoising process we have described in this section is not exact; however, in
most cases in practice the level of noise is suppressed considerably.
70
First 3 evector components of jittered Gaussian pulse
0.3
0.1
-0.1
-0.2
-0.3
0 10 20 30 40 50 60 70 80 90 100
time
0.8
Gaussian pulse
0.6
0.4
0.2
-0.2
0 10 20 30 40 50 60 70 80 90 100
time
Figure 2.15. A comparison between the original (jittered) noise–free waveform (dotted,
red), the waveform corrupted by coloured noise, (blue, dash-dot), and the denoised version,
shown in (black, dashed).
71
Example: Classification Using PCA Coefficients
10
5
volts
-5
-10
0 5 10 15 20 25 30 35 40 45
time(samples)
Figure 2.16. Examples of two low–pass waveforms for the classification example.
20
15
10
5
volts
-5
-10
-15
-20
-25
0 5 10 15 20 25 30 35 40 45
time (samples)
Figure 2.17. Examples of two high–pass waveforms for the classification example.
72
A data matrix X lo was formed from the low–pass data in a manner similar
to that in Sect. 2.6.1. Each row of X lo represents a window of data from the
low–pass data. The matrix X lo consists of m = 200 rows, each of n = 50
samples long. The covariance matrix Rlo ∈ Rn×n was formed in the usual
manner as Rlo = X Tlo X lo . Two principal evectors were then extracted from
Rlo to form the matrix V lo ∈ R50×2 . The same procedure was applied on
the high–pass data to generate X hi and V hi . The two prinicpal eigenvector
waveforms corresponding to the two classes (low–pass and high–pass) are
shown in Figs. 2.18 and 2.19 respectively. It may be observed that the low–
pass eigenvectors shown in Fig. 2.18 vary smoothly from one sample to the
next, as is characteristic of a low–pass waveforms; i.e., adjacent samples are
positively correlated, whereas the high–pass eigenvector waveforms tend to
change sign between adjaceent samples; i.e., adjacent samples are negatively
correlated in this case.
Eigenvector waveforms from Low-pass data
0.25
V(:,1)
0.2 V(:,2)
0.15
0.1 X 44
Y 0.05703
0.05
"volts"
-0.05
-0.1
-0.15
-0.2
0 5 10 15 20 25 30 35 40 45
Time (samples)
Figure 2.18. The two principal eigenvector waveforms for the low–pass data.
0.15
0.1
0.05
volts
-0.05
-0.1
-0.15
0 5 10 15 20 25 30 35 40 45 50
time (samples)
Figure 2.19. The two principal eigenvector waveforms for the high–pass data.
73
There are several different techniques we can apply to classify a test sample
into one of these two classes. The approach used here is to use the PCA
coefficients formed from the low–pass eigenvectors V lo as features for the
classification process. In this respect, we form two 200 × 2 matrices θ i , i ∈
[lo, hi], one for each class, where each row (consisting of 2 elements) are the
PCA coefficients obtained using the first two principal eigenvectors V lo of
the low–pass process. The θ–matrices are generated in the following manner:
θ lo = X lo V lo and (2.51)
θ hi = X hi V lo . (2.52)
Note that the low–pass eigenvectors are used in each case. These coefficients
are then used as features for a random forest classifier, which is implemented
using the Matlab Classification Learner toolbox. A scatterplot of the θ lo and
θ hi coefficients, where each 2–dimensional row of θ represents a single point,
is shown in Fig. 2.20. The red and blue points correspond to the low–
pass and high–pass processes respectively. It is seen the classes separate
very cleanly, with the high–pass points concentrating near the origin, and
the low–pass points scattered throughout the feature space. The overall
training accuracy for this experiment is 98.3%. By inspection of the high–
pass waveforms of Fig. 2.17, we see the samples alternate sign between
adjacent samples, whereas the low–pass eigenvector waveforms of Fig. 2.18
vary smoothly. Eq. (2.52) evaluates the sample covariance between the
high–pass process and the low–pass eigenvectors. It may be observed that
the characteristics of the waveforms involved lead to the covariances in this
case being low in value, and therefore the high–pass points in Fig. 2.20
concentrate near the origin. On the other hand, in (2.51), because both
waveforms are slowly varying and mutually similar, the covariance values
are larger.
Note that we could have conducted the same experiment by using the high–
pass eigenvectors in (2.51) and (2.52) instead of the low–pass eigenvectors,
and the results would be similar, except the low–pass and high–pass samples
in this case would be reversed in role. The reader is invited to explain why
the procedure described in (2.51) and (2.52) for evaluating the features is
similar in many respects to passing samples from both classes through a
low–pass (or high–pass) filter and evaluating the variances at the output.
In this case, the classes would cluster in a similar manner to that shown
in Fig. 2.20. While this is another suitable method for discriminating the
two classes, it has little pedagogical value for the present purposes since it
doesn’t use eigenvectors.
74
Scatterplot for Classification Example
Model predictions
-1 - Incorrect
30 -1 - Correct
1 - Incorrect
1 - Correct
20
10
0
PC 1
-10
-20
-30
-40
-50
Figure 2.20. The scatterplot of the PCA coefficients θ corresponding to the low–pass eigen-
vectors, which are used as features for the classifier. The coeficients from a low–pass sample
are represented in red, whereas the high–pass samples are in blue. There are 200 samples
from each class.
75
2.6.2 PCA vs. Wavelet Analysis:
One of the practical difficulties in using PCA for compression is that the
eigenvector set V is usually not available at the reconstruction stage in
practical cases when the observed signal is mildly or severely nonstationary,
as with the case with speech or video signals. In this case, the covari-
ance matrix estimate R̂x changes with time; hence so do the eigenvectors.
Provision of the eigenvector set for reconstruction is expensive in terms of
information storage and so is undesirable. Wavelet functions, which can be
regarded as another form of orthonormal basis, can replace the eigenvector
basis in many cases. While not optimal, the wavelet transform still displays
an ability to concentrate coefficients, and so performs reasonably well in
compression situations. The advantage is that the wavelet basis, unlike the
eigenvector basis, is constant and so does not vary with time. The current
MPEG standard for audio and video signals use the wavelet transform for
compression.
On the other hand, in many instances where denoising is the objective, the
PCA basis may be more effective than wavelets. In these cases where real-
time performance is not required and a large sample of data is available, the
covariance and eigenvector matrices are readily computed and so denoising
with the PCA basis is straightforward to implement. Also, because of the
optimality of the eigenvector basis, a cleaner denoised signal is likely to
result.
76
Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm.
The matrix p-norm of an arbitrary matrix A, denoted ||A||p , is defined as
||Ax||p
||A||p = sup (2.53)
x6=0 ||x||p
where “sup” means supremum; i.e., the largest value of the argument over
all values of x 6= 0. Since a property of a vector norm is ||cx||p = |c| ||x||p
for any scalar c, we can choose c in (2.53) so that ||x||p = 1. Then, an
equivalent statement to (2.53) is
For the specific case where p = 2 for A square and symmetric, it follows
from (2.54) and Sect. 2.2 that ||Ax||2 = λ1 . More generally, it is shown in
the next lecture for an arbitrary matrix A that
||A||2 = σ1 (2.55)
where σ1 is the largest singular value of A. This quantity results from the
singular value decomposition, to be discussed in the next chapter.
and
n
X
||A||∞ = max |aij | ( maximum row sum). (2.57)
1≤i≤m
j=1
Frobenius Norm: The Frobenius norm is the 2-norm of the vector ob-
tained by concatenating all the rows (or columns) of the matrix A:
1/2
Xm X
n
||A||F = |aij |2
i=1 j=1
77
Properties of Matrix Norms
This property follows by dividing both sides of the above by ||x||p , and
applying (2.53).
||QAZ||2 = ||A||2
and
||QAZ||F = ||A||F
Thus, we see that the matrix 2–norm and Frobenius norm are invariant
to pre– and post– multiplication by an orthonormal matrix.
3. Further,
||A||2F = tr AT A
where tr(·) denotes the trace of a matrix, which is the sum of its diag-
onal elements. While we are considering trace, an important property
of the trace operator is
12
Z is the set of positive integers, excluding zero.
78
Appendix
For the problem at hand, we assign a(x) = xT , and b(x) = Ax. Then it
is readily verified that the first term of (2.59) is xT ai , while the second is
aTi x, where aTi is the ith row of A. Combining the results for i = 1, . . . , n
into a vector, we have
df (x)
= xT A + Ax.
dx
To combine these two terms into a more convenient form, we are at lib-
erty to transpose the first term, since the values of the derivatives remain
unchanged. We then end up with the result
df (x)
= AT x + Ax = (AT + A)x.
dx
In the case when A is symmetric, then
df (x)
= 2Ax.
dx
This result is loosely analogous to the scalar case where the derivative
d 2
dx ax = 2ax, where a ∈ R.
13
This may be proved in an analogous manner to the scalar case.
79
2.9 Problems
where x[k] = 0 for k > m or k < 0. Using the x-data in file as-
sig2Q5 2019.mat on the website, find f [k] of length n = 10 so that
||y||22 is minimized, subject to ||f ||22 = 1.
6. On the course website you will find a file assig1 Ch2 Q6 2020.mat,
which contains a matrix X ∈ Rm×n of data corresponding to the
example of Sect. 2.6.1. Each row is a time–jittered Gaussian pulse
corrupted by coloured noise. Here, m = 1000 and n = 100, as per
the example. Using your preferred programming language, produce a
denoised version of the signal represented by the first row of X.
7. On the course website you will find a .mat file assig2Q7 2019.mat. It
contains a matrix X whose columns contain two superimposed Gaus-
sian pulses with additive noise. Using methods discussed in the course,
estimate the position of the peaks of the Gaussian pulses.
80
2.12. The width (σ) and position (µ) of the pulse are invariant with
i. We formPthe sample covariance matrix R̂ over the N observations
as R̂ = N1 N T
i=1 xi xi .
where k is the time index and 0 < λ < 1 is a parameter that controls
the adaptation rate. Explain how the method operates, and what is
the effect of varying λ? What happens to the observation x(ko ), where
ko is constant, as time increases?
81
82
Chapter 3
In this chapter we learn about one of the most fundamental and important
matrix decompositions of linear algebra: the SVD. It bears some similarity
with the eigendecomposition (ED), but is more general. Usually, the ED is
of interest only on symmetric square matrices, but the SVD may be applied
to any matrix. The SVD gives us important information about the rank,
the column and row spaces of the matrix, and leads to very useful solutions
and interpretations of least squares problems. We also discuss the concept
of matrix projectors, and their relationship with the SVD.
83
Theorem 1 Let A ∈ Rm×n be a rank r matrix (r ≤ p = min(m, n)). Then
A can be decomposed according to the singular value decomposition as
A = UΣVT (3.1)
σ1 ≥ σ2 ≥ σ3 . . . ≥ σp ≥ 0.
The matrix Σ must be of dimension Rm×n (i.e., the same size as A), to
maintain dimensional consistency of the product in (3.1). It is therefore
padded with appropriately–sized zero blocks to augment it to the required
size.
Since U and V are orthonormal, we may also write (3.1) in the form:
UT A V = Σ
(3.2)
m×m m×n n×n m×n
1
The concept of positive semi–definiteness is discussed in the next chapter. It means
all the eigenvalues are greater than or equal to zero.
84
2
where Σ̃ = diag[σ12 , . . . , σr2 ]. We now partition V as [V 1 V 2 ], where
V 1 ∈ Rn×r . Then (3.3) has the form
" #
2
V T1
T
A A V1 V2
= Σ̃ 0 . (3.4)
V2 T 0 0
r n−r
n
Then by equating corresponding blocks in (3.4) we have
2
V T1 AT AV 1 = Σ̃ (r × r) (3.5)
T T
V 2 A AV 2 = 0. (n − r) × (n − r). (3.6)
From (3.5), we can write
−1 −1
Σ̃ V T1 AT AV 1 Σ̃ = I. (3.7)
Then, we define the matrix U 1 ∈ Rm×r from (3.7) as
−1
U 1 = AV 1 Σ̃ . (3.8)
Then, noting that the product of the first three terms in (3.7) is the transpose
of the product of the latter three terms, we have U T1 U 1 = I and it follows
that
U T1 AV 1 = Σ̃. (3.9)
From (3.6) we also have
AV 2 = 0. (3.10)
We now choose a matrix U 2 so that U ∈ Rm×m = [U 1 U 2 ] is orthonormal.
Then from (3.8) and because U 1 ⊥ U 2 , we have
−1
U T2 U 1 = U T2 AV 1 Σ̃ = 0. (3.11)
Therefore
U T2 AV 1 = 0. (3.12)
Combining (3.9), (3.10) and (3.12), we have
T
U 1 AV 1 U T1 AV 2
T Σ̃ 0
U AV = = (3.13)
U T2 AV 1 U T2 AV 2 0 0
which was to be shown.
85
3.1.1 Relationship between SVD and ED
V T1
T
Σ̃ 0 T Σ̃ 0
A A = V1 V2 U U
0 0 0 0 V T2
2
= V 1 Σ̃ V T1
" #
2
= V Σ̃ 0 VT
0 0
which indicates that the eigenvectors of AAT are the left singular vectors
U of A, and the squared singular values of A are the nonzero eigenvalues
of AAT . Notice that in this case, if A is tall and full rank, the matrix Σ
86
corresponding to AAT will contain m − n additional zero eigenvalues that
are not included as singular values of A. If rank(A) = r and if m − r ≥ 2,
then there are repeated zero eigenvalues and U 2 is not unique in this case.
s.v.
| {z } | {z }
r non-zero s.v’s p−r zero s.v.’s
We also partition the U and V as before in the previous section. We can
then write the SVD of A in the form
T
Σ̃ 0 V1
A = U1 U2 (3.15)
0 0 V T2
where where Σ̃ ∈ Rr×r = diag(σ1 , . . . , σr ), and U is partitioned as
U= U1 U2 m
r m−r
rank(A) = r
87
combinations of U 1 ∈ Rm×r . Therefore, since B is full rank, R(A) spans r
dimensions and so rank(A) = r. If r < p = min(m, n), then there are p − r
zero singular values.
N (A) = R(V 2 )
R(A) = R(U1 )
R(AT ) = R(V1 )
88
R(A)⊥ = R(U 2 )
||A||2 = σ1 = σmax
This is straightforward to see from the definition of the 2-norm and the
ellipsoid example to follow in Section 3.1.3.
Inverse of A
A−1 = V Σ−1 U T .
The evaluation of Σ−1 is simple because it is square and diagonal. Note that
−1
this treatment indicates that the singular values of A−1 are [σn−1 , σn−1 , . . . , σ1−1 ]
in that order. The only difficulty with this approach is that in general, find-
ing the SVD is more costly in computational terms than finding the inverse
by more conventional means.
89
and
V T1
r d1
d= = x. (3.18)
n−r d2 V T2
Substituting the above into (3.16), the system of equations becomes
Σd = c. (3.19)
This shows that as long as we choose the correct bases, any system of equa-
tions can become diagonal. This property represents the power of the SVD;
it allows us to transform arbitrary algebraic structures into their simplest
forms.
The above equation reveals several interesting facts about the solution of the
system of equations. First, if m > n (A is tall) and A is full rank, then the
right blocks of zeros in Σ, as well as the quantity d2 , are both empty. In this
case, the system of equations can be satisfied exactly only if c2 = 0. This
implies that U T2 b = 0, or that b ∈ R(U 1 ) = R(A) for an exact solution to
exist. This results makes sense, since in this case since the quantity Ax is a
linear combination of the columns of A and therefore the equation Ax = b
can only be satisfied iff b ∈ R(A).
If m < n (A is short) and full rank, then the bottom blocks of zeros in Σ,
−1
as well as c2 in (3.20) are both empty. In this case we have d1 = Σ̃ c1
form the top row and d2 arbitrary from the bottom row. We can write these
relationships in the form
T −1
d1 V1x Σ c1
= = .
d2 V T2 x d2
where in the top line we have substituted (3.17) for c1 and (3.18) for d.
Thus we see that the solution consists of a “basic” component, which is the
90
first term above. This term is closely related to the pseudo–inverse, which
we discuss in some detail in Ch. 8. Since V 2 is a basis for N (A) and d2 is
an arbitrary n − r vector, the second term above contributes an arbitrary
component in the nullspace of A to the solution. Thus x is not unique. It
is straightforward to verify that the quantity AV 2 d2 = 0, so the addition
of the second term does not affect the fact we have an exact solution.
If A is not full rank, then none of the zero blocks in (3.20) are empty. This
implies that both scenarios above both apply in this case.
AV = U Σ.
Note that since Σ is diagonal, the matrix U Σ on the right has orthogonal
columns, whose 2–norm’s are equal to the corresponding singular value.
We can therefore interpret the matrix V as an orthonormal matrix which
rotates the rows of A so that the result is a matrix with orthogonal columns.
Likewise, we have
U T A = ΣV T .
The matrix ΣV T on the right has orthogonal rows with 2–norm equal to the
corresponding singular value. Thus, the orthonormal matrix U T operates
(rotates) the columns of A to produce a matrix with orthogonal rows.
A = QΛQT → AQ = QΛ,
Aq i = λi q i i = 1, . . . , n. ∗ (3.22)
91
For the SVD, we have
A = UΣVT → AV = UΣ
or
Av i = σi ui i = 1, . . . , p, ∗ (3.23)
where p = min(m, n). Also, since AT = V ΣU T → AT U = V Σ, we have
AT ui = σi v i i = 1, . . . , p. ∗ (3.24)
Thus, by comparing (3.22), (3.23), and (3.24), we see the singular vectors
and singular values obey a relation which is similar to that which defines
the eigenvectors and eigenvalues. However, we note that in the SVD case,
the fundamental relationship expresses left singular values in terms of right
singular values, and vice-versa, whereas the eigenvectors are expressed in
terms of themselves. These SVD relations are used in Chapter 9 to develop
the partial least squares regression method.
The singular values of A, where A ∈ Rm×n are the lengths of the semi-axes
of the hyperellipsoid E given by:
E = {y | y = Ax, ||x||2 = 1} .
That is, E is the set of points mapped out as x takes on all possible values
such that ||x||2 = 1, as shown in Fig. 3.1. To appreciate this point, we look
at the set of y corresponding to {x | ||x||2 = 1}. We take
y = Ax (3.25)
T
= U ΣV x.
We change bases for both x and y. Define
c = UT y
d = V T x.
92
span(v1)
span(u1)
sigma1
E
span(v2)
sigma2
span(u2)
{x | ||x|| = 1}
Figure 3.1. The ellipsoidal interpretation of the SVD. The locus of points E = {y | y =
Ax, ||x||2 = 1} defines an ellipse. The principal axes of the ellipse are aligned along the
left singular vectors ui , with lengths equal to the corresponding singular value.
p p
ci 2
X X
= (di )2 = 1.
σi
i=1 i=1
We see that the set {c} is indeed the canonical form of an ellipse in the basis
U . Thus, the principal axes of the ellipse are aligned along the columns
ui of U , with lengths equal to the corresponding singular value σi . This
interpretation of the SVD is useful later in our study of condition numbers.
93
A Useful Theorem [1]
where r ≤ p = min(m, n). Given A ∈ Rm×n with rank r, then what is the
matrix B ∈ Rm×n with rank k < r closest to A in 2-norm? What is this
2-norm distance? This question is answered in the following theorem:
then
min ||A − B||2 = ||A − Ak ||2 = σk+1 .
rank(B)=k
In words, this says the closest rank k < r matrix B matrix to A in the
2–norm sense is given by Ak . Ak is formed from A by truncating contri-
butions in (3.27) associated with the smallest singular values. This idea
may be seen as a generalization of PCA, where here we construct low–rank
approximations to matrices instead of vectors.
Proof:
||A − Ak ||2 = UT (A − Ak )V2
where the first line follows from the fact the the 2-norm of a matrix is invari-
ant to pre– and post–multiplication by an orthonormal matrix (properties
94
of matrix p-norms, Chapter 2). Further, it may be shown [1] that, for any
matrix B ∈ Rm×n of rank k < r, [1]
where the columns of V are the eigenvectors of the covariance matrix R and
θ̂ is the sequence of PCA coefficients truncated to r non-zero coefficients.
Then the covariance matrix R̂ corresponding to x̂i is given as
R̂ = E(x̂x̃T )
T T
= E(Vθ̂ θ̂ VT ) = V E(θ̂ θ̂ )VT
= V Λ̂V T , (3.32)
where ui and v i are the columns of U and V of the true image respectively,
and the σi are the singular values. The true image is represented by k = 512;
2
The proof is left as an exercise.
95
Lena image reconstructed with k=1 component
96
Lena image with k=5 components
97
Lena image with k−20 components
98
Figure 3.7. The geometry of the orthogonal projection operation.
which y S = P y is given as
P = Q1 QT1 . (3.34)
3
The Cholesky decomposition, which we will study in Ch. 5, is one such example.
99
this, we take
WTW = C −T X T XC −1
= C −T RC −1 = C −T C T CC −1
= I.
P = WWT
= XC −1 C −T X T
= XR−1 X T
= X(X T X)−1 X T . (3.35)
1. R(P) = S
2. P2 = P
3. PT = P
then P is a projection matrix onto S. The above are sufficient conditions for
a projector. This means that while these conditions are enough to specify
a projector, there may be other conditions which also specify a projector.
But since we have now proved the projector is unique, these conditions are
also necessary.
100
It is readily verified that both definitions for the projector i.e., P = Q1 QT1
and P = X(X T X)−1 X T ] satisfy the above properties.
y = ys + yc
= P y + yc.
Therefore we have
y − P y = yc
(I − P ) y = y c .
101
2. V 2 V T2 is the orthogonal projector onto N (A)
102
Appendices
Consider two vectors x and y where ||x||2 = ||y||2 = 1, s.t. Ax = σy, where
σ = ||A||2 . The fact that such vectors x and y can exist follows from the
definition of the matrix 2-norm. We define orthonormal matrices U and V
so that x and y form their first columns, as follows:
U = [y, U1 ]
V = [x, V1 ]
UT AV = A1
T
y A[x, V1 ]
= (3.36)
U1 T
σy T y y T AV1
↓ ↓
1 ∆
σ wT
= = A1 . (3.37)
m−1
0 B
1 n−1
∆
where B = UT1 AV1 . The 0 in the (2,1) block above follows from the fact
that U1 ⊥ y, because U is orthonormal.
103
σ
Now, we post-multiply both sides of (3.37) by the vector and take
w
2-norms:
2 T
2
A1 σ = σ w σ
≥ (σ 2 + wT w)2 .
(3.38)
w 2 0 B w 2
This follows because the term on the extreme right is only the first element
of the vector product of the middle term. But, as we have seen, matrix
p-norms obey the following property:
σ 2
2
||A1 ||22 ≥ A1 σ ≥ (σ 2 + wT w)2 .
(3.40)
w 2 w 2
σ 2
Note that
= σ 2 + wT w. Dividing (3.40) by this quantity, we
w 2
obtain
||A1 ||22 ≥ σ 2 + wT w. (3.41)
But, we defined σ = ||A||2 . Therefore, the following must hold:
where the equality on the right follows because the matrix 2-norm is invari-
ant to matrix pre- and post-multiplication by an orthonormal matrix. By
comparing (3.41) and (3.42), we have the result w = 0.
The whole process repeats using only the component B, until An becomes
diagonal.
104
3.4 Problems
(a) Construct the projectors for each of the four fundamental sub-
spaces of A.
(b) Explain how to test whether a vector y is in a specified subspace
S.
(c) Construct a random vector x and project it onto the four sub-
spaces, using the respective projectors from part a, to yield the
result y.
(d) Verify that each y is indeed in the subspace corresponding to the
respective P .
105
time and memory to calculate. Suggest a more practical approach for
determining V . Also show how to find the first r columns of U , given
V . Using similar ideas, extend your method to the case where A is
large and very short, and we are interested only in U . Also show how
to determine the first r columns of V . Hint: Consider the matrices
AT A and AAT , respectively.
106
Chapter 4
xT Ax > 0. (4.1)
The matrix A is positive semi–definite if and only if, for any x 6= 0 we have
xT Ax ≥ 0, (4.2)
which, as we see later, includes the possibility that A is rank deficient. The
quantity on the left in (4.1) is referred to as a quadratic form of A. It
may be verified by direct multiplication that the quadratic form can also be
expressed in the form
n X
X n
T
x Ax = aij xi xj . (4.3)
i=1 j=1
107
have the desired properties that T T = T , S = −S T , and A = T + S. Note
the diagonal elements of S must be zero.
108
Chapters 1 and 2, z is a rotation of x due to the fact V is orthonormal.
Thus we have
xT Ax = z T Λz
Xn
= zi2 λi . (4.6)
i=1
Thus (4.6) is greater than zero for arbitrary x if and only if λi > 0, i =
1, . . . , n.
We also see from (4.6) that if the equality in the quadratic form is satisfied,
(xT Ax = 0 for some x and corresponding z) then at least one eigenvalue of
T must be zero. Hence, if A is symmetric, then A being positive semidefinite
implies that at least one eigenvalue of A must also be zero, which means
that A is rank deficient.
The locus of points {x|xT Ax = k, k > 0}, defines a scaled version of the
ellipse above.
q In this case, the ith principal axis length is given by the
k
quantity λ1 .
109
250
200
150
100
50
0
10
5 10
0 5
0
-5
-5
-10 -10
110
Theorem 5 A (square) symmetric matrix A can be decomposed into the
form A = BB T if and only if A is positive definite or positive semi–definite.
xT Ax = xT BB T x
= zT z
≥ 0. (4.9)
The fact that A can be decomposed into two symmetric factors in this way
is the fundamental idea behind the Cholesky factorization, which is a major
topic of the following chapter.
Here, we very briefly introduce this topic so we can use this material for
an example of the application of the Cholesky decomposition later in this
course, and also in least-squares analysis to follow shortly. This topic is a
good application of quadratic forms. More detail is provided in several books
[12, 13]. First we consider the uni–variate case of the Gaussian probability
111
0.06
0.04
0.02
0
5
5
0
0
-5 -5
Figure 4.2. A Gaussian probability density function with covariance matrix [2 1; 1 2].
We can see that the multi-variate case collapses to the uni-variate case when
the number of variables reduces to one. A plot of p(x) vs. x is shown in
112
Fig. 4.2, for a mean µ = 0 and covariance matrix Σ = Σ1 defined as
2 1
Σ1 = . (4.12)
1 2
Because the exponent in (4.11) is a quadratic form, the set of points satisfied
by the equation 21 (x − µ)Σ−1 (x − µ) = k where k is a constant, is an
where R is the interior of the ellipse. Stated another way, an ellipse is the
region in which any observation governed by the probability distribution
(4.11) will fall with a specified probability level α. As k increases, the
ellipse gets larger, and α increases. These ellipses are referred to as joint
confidence regions (JCRs) at probability level α.
The covariance matrix Σ controls the shape of the ellipse. Because the
quadratic
√ form in thispcase involves Σ−1 , the length of the ith principal axis
is 2kλi instead of 2k/λi as it would be if the quadratic form were in
Σ. Therefore as the eigenvalues of Σ increase, the size of the JCRs increase
(i.e., the variances of the distribution increase) for a given value of k.
113
0.5
0.4
0.3
0.2
0.1
0
5
5
0
0
-5 -5
Figure 4.3. A Gaussian pdf with larger covariance elements. The covariance matrix is
[2 1.9; 1.9 2].
Note that in this case, the covariance elements of Σ2 have increased substan-
tially relative to those of Σ1 in Fig. 4.2, although the variances themselves
(the main diagonal elements) have remained unchanged. By examining the
pdf of Figure 4.3, we see that the joint confidence ellipsoid has become
elongated, as expected. (For Σ1 of Fig. 4.2 the eigenvalues are (3, 1), and
for Σ2 of Fig. 4.3, the eigenvalues are (3.9, 0.1)). This elongation results
in the conditional probability p(x1 |x2 ) for Fig. 4.3 having a much smaller
variance (spread) than that for Fig. 4.2; i.e., when the covariances are
larger, knowledge of one variable tells us more about the other. This is how
the probability density function incorporates the information contained in
the covariances between the variables. With regard to Gaussian probabil-
ity density functions, the following concepts: 1) larger correlations between
the variables, 2) larger disparity between the eigenvalues, 3) elongated joint
confidence regions, and 4) lower variances of the conditional probabilities,
are all closely inter–related and are effectively different manifestations of
correlations between the variables.
114
4.4 The Rayleigh Quotient
xT Ax
r(x) = . (4.15)
xT x
It is easily verified that if x is the ith eigenvector v i of A, (not necessarliy
normalized to unit norm), then r(x) = λi :
v Ti Av i λi v T v
=
v Ti v i vT v
= λi . (4.16)
115
This procedure exhibits cubic convergence to the eigenvector. At conver-
gence, µ is an eigenvalue, and z is the corresponding eigenvector. Therefore
the matrix (A − µI) is singular and z is in its nullspace. The solution z be-
comes extremely large and the system of equations (A−µI)z = x is satisfied
only because of numerical error, since x should normally be 0. Nevertheless,
accurate values of the eigenvalue and eigenvector are obtained.
We start with an initial guess x(0) of the eigenvector. Because the eigenvec-
tors are linearly independent, we can express x(0) in the eigenvector basis
V = [v 1 , v 2 . . . , v n ] as
Xn
x(0) = cj v j . (4.17)
j=1
k
λ1
We multiply the above by λ1 = 1 to obtain
n
k X λ k
j k
x(k) = λ1 cj v j −→ λ1 c1 v 1 , (4.19)
λ1
j=1
116
as k becomes large. The result on the right follows because the terms
k
λj
λ1 → 0, j 6= 1, as k becomes large, since λ1 is the largest eigen-
vector. When (4.19) is satisfied, the method has converged, and we have
x(k + 1) = λ1 x(k). At this point, λ1 is revealed, and v 1 = x(k + 1).
There is a practical matter remaining, and that is from (4.18) we see that
x(k) can become very large or very small as k increases, depending whether
the λ’s are greater than or less than 1, leading to floating point over– or
underflow. This situation is easily remedied by replacing x(k) with ||xx(k)||
(k)
2
at each iteration. Other scaling options are possible.
1. Initialize x(0) to some suitable value. Often setting all the elements
to one is a good choice. Initialize k = 0.
2. x(k+1)=Ax(k).
3. set µk+1 = ||x(k + 1)||2 and replace x(k + 1) with x(k + 1)/µ(k + 1).
(a) λ1 = µ(k + 1)
(b) v 1 = x(k + 1).
(c) return.
5. k = k + 1
6. go to step 2.
117
The power method as described above allows us to specify the first term in
this expansion. We can therefore define a deflated version Adef as
Adef = A − λ1 v 1 v T1 .
One caveat regarding this sequential power method is that if the dimension
of A is large, then small errors due to floating point error etc. in the early
eigen–pair estimates can compound, leaving the eigen–pair estimates of the
later stages inaccurate. If the complete eigendecomposition is desired, then
there are better methods which use the QR decomposition (to be described
later) for finding the complete eigendecomposition.
118
Appendix
where the first term of (4.21) corresponds to holding i constant at the value
k, and the second corresponds to holding j constant at k. Care must be
taken to include the term x2k akk corresponding to i = j = k only once;
therefore, it is excluded in the first two terms and added in separately. Eq.
(4.21) evaluates to
d T X X
x Ax = xj akj + xi aik + 2xk akk
dxk
j6=k i6=k
X X
= xj akj + xi aik
j i
= [Ax]k + AT x k
= (A + AT )x k
= [2T x]k
119
result that
d T
x Ax = 2T x.
dx
max xT Ax.
||x||22 =1
xT Ax + λ 1 − xT x .
2x) gives
T x = λx.
Thus, the eigenvectors are stationary points of the quadratic form, and the
x which gives the maximum (or minimum), subject to a norm constraint, is
the maximum (minimum) eigenvector of A.
120
4.7 Problems
2
i←i+1
r
||x(i+1) ||2
After convergence, u1 = y, v 1 = x and σ1 = ||x(i) ||2
. (Values
taken before normalization).
2. Consider the inverse power method for computing the smallest eigen–
pair of a matrix A. Show that convergence can be significantly ac-
celerated by replacing A with A − γI, where γ is an estimate of the
smallest eigenvalue, before inversion of A.
1 T
x Ax + bT x + c = 0,
2
where x ∈ Rn and A is positive definite.
121
4. Prove that the diagonal elements of a positive definite matrix must be
positive.
122
Chapter 5
where
s = sign bit = ±1
f = fractional part of x of length t bits
b = machine base = 2 for binary systems
k = exponent
123
Note that the operation fl(x)(i.e., conversion from a real number x to its
floating point representation) maps a real number x into a set of discrete
points on the real number line. These points are determined by (5.1). This
mapping has the property that the separation between points is proportional
to |x|. Because the operation fl(x) maps a continuous range of numbers into
a discrete set, there is error associated with the representation fl(x).
In the conversion process, the exponent is adjusted so that the most signif-
icant bit (msb) of the fractional part is 1, and so that the binary point is
immediately to the right of the msb. For example, the binary number
x = .0000100111101011011 (5.2)
1.00111101 × 2−5 .
Since it is known that the msb of the fractional part is a one, it does not need
to be physically present in the actual floating-point number representation.
This way, we get an extra bit, “for free”. This means the number x in (5.2)
may be represented as
−5
| {z } ×2 .
00111101
f
↑ leading 1 assumed present
This above form only takes 8 bits instead of 9 to represent fl(x) with the
same precision.
The range of possible real numbers which can be mapped into the represen-
tation |fl(x)| is:
| ← t bits → |
1.00 . . . 00 × 2L ≤ |f l(x)| ≤ 1.111111 . . . 1 ×2U
where L and U are the minimum and maximum values of the exponent,
respectively. Note that any arithmetic operation which produces a result
outside of these bounds results results in a floating point overflow or under-
flow error.
124
Note that because the leading one in the most significant bit position is
absent, it is now impossible to represent the number zero. Thus, a special
convention is needed. This is usually done by reserving a special value of
the exponent field.
Machine Epsilon u
Since the operation f l(x) maps the set of real numbers into a discrete set,
the quantity f l(x) involves error. The quantity machine epsilon, represented
by the symbol u is the maximum relative error possible in f l(x).
125
if the machine chops. By “chopping”, we mean the machine constructs the
fractional part of fl(x) by retaining only the most significant t bits, and
truncating the rest. If the machine rounds, then the relative error is one
half that due to chopping; hence
u = 2−t
if the machine rounds. Thus, the number fl(x) may be represented as
fl(x) = x(1 + ), where || ≤ u.
It is also noteworthy that if we perform operations on a sequence of n floating
point
Pn numbers, then the worst–case error accumulates; i.e., if we evaluate
i=1 xi , then it is possible in the worst case that each of the xi are subject
to the maximum relative error of the same sign, in which case the maximum
relative error of the sum becomes nu. This result holds for both addition
and subtraction operations. It may also be shown that the same result also
holds (to a first order approximation) for both multiplication and division
operations.
It turns out that in order to perform error analysis on floating- point matrix
computations, we need the absolute value notation:
126
5.1.1 Catastrophic Cancellation
|←rbits→|
frac(A) =
1011011 101
|←rbits→|
frac(B) =
1011011 001
where frac(·) is the fractional part of the number. Because the numbers
are nearly equal, it may be assumed that their exponents have the same
value. Then, we see that the difference frac(A − B) is (100)2 , which has only
t − r = 3 bits significance. We have lost 7 bits of significance in representing
the difference, which results in a drastic increase in u. Thus the difference
can be in significant error.
Solution: √
−b ± b2 − 4ac
x= (5.5)
2a
There are obviously serious problems with the accuracy of x2 , which √ corre-
sponds to the “+” sign in (5.5) above. In this case, since b2 >> 4ac, b2 − 4ac '
127
b. Hence, we are subtracting two nearly equal numbers when calculating x2 ,
which results in catastrophic cancellation.
In the case where the vectors are not closely orthogonal, then many of the
products (5.6) all have the same sign and the effect of catastrophic cancel-
lation is suppressed, and so there is little if any reduction in the number of
effective significant bits. In this case, the relative error in the inner product
can be expressed in the form
f l(xT y) − xT y ≤ nu xT y .
(5.7)
We see that by dividing each side by |xT y|, we get exactly the relative
error we would expect when representing a sum of n numbers in floating
point format. However, when the vectors become close to orthogonality, the
number of effective significant bits becomes reduced and so the bound of
(5.7) no longer applies as is. It is shown [1] that in this case,
f l(xT y) − xT y ≤ nu|x|T |y| + O(u2 ),
(5.8)
where the absolute value notation of Sect.5.1 has been applied; i.e., we
consider |x| and |y|, which denotes the absolute value of the elements of the
vectors. The notation O(u2 ), read “order u squared”, indicates the presence
of terms in u2 and higher, which can be ignored due to the fact they may be
considered small in comparison to the first-order term in u. Hence (5.8) tells
us that if |xT y| |x|T |y|, which happens when x is nearly orthogonal to y,
then the relative error in f l(xT y) may be much larger than the anticipated
result, which is that the error is upper bounded by nu. This is due to the
catastrophic cancellation implicitly expressed in the form of (5.6).
128
Fix: If the partial products are accumulated in a double precision register
(length of fractional part = 2t), little error results. This is because multipli-
cation of two t-digit numbers can be stored exactly in a 2t digit mantissa.
Hence, roundoff only occurs when converting to single precision, and the
result is significant to approximately t bits significance in single precision.
Ax = b
To solve the system, we transform this system into the following upper
triangular system by Gaussian elimination:
a11 a12 a13 x1 b1
a022 a023 x2 = b02 → Ux = b0 (5.9)
00
a33 x3 00
b3
129
is designed to place a zero in the appropriate place below the main diagonal
of A.
for i = n, . . . , 1
xi := bi
for j = i + 1, . . . , n
xi := xi − uij xj
xi
xi := uii
end
(U + E)x̂ = b0
where |E| ≤ nu|U | + O(u2 ), and u is machine epsilon. The above equation
says that x̂ is the exact solution to a perturbed system. We see that all
elements of E are of O(nu), which is exactly the error expected in U due
to floating point error alone, with operations over n floating point numbers.
This is the best that can be done with floating point systems. It is worthy
of note that if elements of E have a larger magnitude, then the error in the
solution can be large, such as in the case with Gaussian elimination without
pivoting, as we see later. However in the case at hand, we can conclude that
back substitution is stable. By a numerically stable algorithm, we mean one
that produces relatively small errors in its output values for small errors in
the input values.
130
The total number of flops required for Gaussian elimination of a matrix
3
A ∈ Rn×n may be shown to be O( 2n3 ) (one “flop” is one floating point
operation; i.e., a floating point add, subtract, multiply, or divide). It is
easily shown that backward substitution requires O(n2 ) flops. Thus, the
number of operations required to solve Ax = b is dominated by the Gaussian
elimination process for moderate n.
Suppose we can find lower and upper n × n triangular matrices L (with ones
along the main diagonal), and U respectively such that:
A = LU .
solve Lz = b for z
and then U x = z for x.
Since both systems are triangular, they are easy to solve. The first system
requires only forward elimination; and the second only back-substitution.
Forward elimination is the analogous process to backward substitution, but
since it is performed on a lower triangular system, the unknowns are solved
in ascending order for forward elimination ( i.e., x1 , x2 , . . . , xn ) instead of
descending order (xn , xn−1 , . . . , x1 ) as in backward substitution. Forward
substitution requires an equal number of flops as back substitution and is
just as stable. Thus, once the LU factorization is complete, the solution of
the system is easy: the total number of flops required to solve Ax = b is
2n2 . The details of the computation of the LU factorization and the number
of flops required is discussed later.
131
3. What is the relationship of LU decomposition, if any, to Gaussian
elimination?
" #
(k−1) (k−1)
A11 A12 k−1
(k−1) (k−1)
A = M k−1 . . . M 1 A = 0 A22 n−k+1 (5.12)
k−1 n−k+1
(k−1)
where A11 is upper triangular.
(k−1)
The fact that A11 is upper triangular means that the decomposition of
(5.12) has already progressed (k − 1) stages, as indicated by the superscript
132
(k − 1). The next stage of Gaussian elimination proceeds one step to make
(k−1)
the first column of A22 zero below the main diagonal element.
Define
M k = I − α(k) ek T (5.13)
where
I is the n × n identity matrix
ek is the k th column of I
↑ k th position
and
δ
α(k) = (0, . . . , 0, lk+1,k , . . . , ln,k )T , (5.14)
where
(k−1)
aik
lik = (k−1)
, i = k + 1, . . . , n. (5.15)
akk
(k−1)
aik
Note that the terms lik = (k−1) above are precisely the multipliers required
akk
to introduce the required zeros, as in (5.10).
(k−1)
The pivot element: The quantity akk , which is the upper left–hand
(k−1)
element of A22 , is the pivot element for the kth stage. This element
plays a strategically significant role in the Gaussian elimination process,
due to the fact it appears in the denominator of (5.15). We will see that
small pivot values lead to large elements in U and L and therefore have the
potential to lead to large errors in the solution x.
133
1 0
..
.
← k th row
1
Mk =
−lk+1,k 1
(5.16)
.. ..
. .
0 −ln,k . . . 1
↑
th
k column
We can visualize the multiplication A(k) = M k A(k−1) with the aid of Fig.
(k−1)
5.1. We assume the pivot element akk 6= 0. It may be verified by in-
spection that the first k rows of the matrix product A(k) are unchanged
relative to those of A(k−1) , as is the lower left block of zeros. We may
gain appreciation for the operation of the Gauss transform, by considering
the most relevant part, which is in forming the kth column of the product
A(k) below the main diagonal. Here we take the inner product of the jth
row (j = k + 1, . . . , n) of M k with the kth column of A(k−1) . Here, the
−lj,k term of M k multiplies the pivot element of A(k−1) , which according to
(5.15) yields the term −aj,k . Due to the ”one” in the jth diagonal position
of M k , this result is then added to the element aj,k . The result over values
j = k + 1, . . . , n is that the kth column of A(k) is replaced with zeros below
the main diagonal, as desired. These arithmetic operations are identical to
those expressed by (5.10), except now we have been able describe the process
using matrix multiplication.
M n−1 . . . M 1 A = U (5.17)
where U is the upper triangular matrix resulting from the Gaussian elimi-
nation process. Each M i is unit lower triangular (ULT) (ULT means lower
134
Figure 5.1. Depiction of the multiplication M k A(k−1) , to advance the Gaussian elimina-
tion process from step k − 1 to step k. The dotted lines in the upper right matrix A(k−1)
show how the partitions advance from the (k − 1)th to the kth stage. The multiplication
(k−1)
process replaces the ×’s in the first column of A22 with zeros, except for the pivot
element.
135
triangular with one’s on the main diagonal), and it is easily verified that the
product of ULT matrices is also ULT. Therefore, we define a ULT matrix
L−1 as
M n−1 . . . M 1 = L−1 (5.18)
From (5.17), we then have L−1 A = U . But since the inverse of a ULT
matrix is also ULT, then
A = LU , (5.19)
which is the product of lower and upper triangular factors as desired. We
have therefore completed the relationship between LU decomposition and
Gaussian elimination. U is simply the upper triangular matrix resulting
from Gaussian elimination, and L is the inverse of the product of the M i ’s.
L = M −1 −1
1 . . . M n−1 . (5.20)
The structure of M −1
k We note that
M −1 (k) T
k = I + α ek . (5.23)
136
We may prove this form is indeed the desired inverse, as follows. Using the
definition of M −1
k from (5.23), we have
M −1 (k) T (k) T
k M k = (I + α ek )(I − α ek )
= I − α(k) ek T + α(k) ek T − α(k) ek T α(k) ek T (5.24)
| {z }
0
= I.
From (5.14), α(k) has non-zero elements only for those indeces which are
greater than k, (i.e., below the main diagonal position). The only nonzero
element of eTk is in the k th position. Therefore, eTk α(k) = 0 as indicated.
Thus M −1k is given by (5.23). We therefore see, that by looking at the
structure of M k carefully, we can perform the inversion operation simply
by inverting a set of signs!
M−1
Q
Structure of L = k k From (5.18) we have
L = (M n−1 , . . . , M 1 )−1
= M −1 −1
1 , . . . , M n−1
n−1
Y
= (I + α(i) ei T ) (5.25)
i=1
where the last line follows from (5.23). Eq. (5.25) may be expressed as
n−1
X
L=I+ α(k) eTk + cross-products of the form α(i) eTi · α(j) eTj (5.26)
k=1
Using similar reasoning to that used in (5.24), it may be shown that the
cross-product terms in (7.14) are all zero. Therefore
n−1
X
L=I+ α(i) eTi . (5.27)
i=1
Each term α(k) eTk in (5.27) is a square matrix of zeros except below the main
diagonal of the k th column. Thus the addition operation in (5.27) in effect
inserts the elements of α(k) in the kth column below the main diagonal of L,
for k = 1, . . . n − 1, without performing any explicit arithmetic operations.
137
The addition of I in (5.27) puts 1’s on the main diagonal to complete the
formulation of L.
As an example, we note from (5.15) that L has the following structure, for
n = 4:
1
(0)
a21
(0)
a11 1
L = (0)
a31 a(1)
31
1
(0) (1)
a11 a22
a(0) a(1) a(2)
41
(0)
42
(1)
43
(2) 1
a11 a22 a33
Example 1:
138
Let
2 −1 0
A = 2 −2 1
−2 −1 5
By inspection,
1
M 1 = −1 1 = I − α1 eT1
1 0 1
2 −1 0
M 1 A = 0 −1
1 = A(2)
0 −2 5
Thus,
1
M2 = 0 1
0 −2 1
and
2 −1 0
M 2 A(2) = 0 −1 1 = U .
0 0 3
What is L = M −1 ?
2
X
L= M −1 −1
1 M2 =I+ α(i) eTi
i=1
Thus,
1
L = 1 1
−1 (2) 1
↑ ↑
α(1) α(2)
139
Note that LU does in fact = A.
(A + E)x̂ = b,
where
|E| ≤ nu [3|A| + 5|L||U |] + O(u2 ). (5.28)
This analysis, as in the back substitution case, shows that x̂ exactly satisfies
a perturbed system. The question is whether the perturbation |E| is always
small. If |E| is of the order induced by floating point representation alone
(i.e., O(nu)), we may conclude that Gaussian elimination yields a solution
which is as accurate as possible in the face of floating point error. But unlike
the back substitution case, further inspection reveals that (5.28) does not
allow such an optimistic outlook. It may happen during the course of the
Gaussian elimination procedure that the term |L||U | may become large, if
small pivot elements are encountered, causing |E| to become large, as we
consider in the following:
140
(k−1)
By referring to (5.16), we can see that if any pivot akk is small in mag-
nitude, then the kth column of M k is large in magnitude. Because M k
premultiplies A(k−1) , large elements in M k will result in large elements in
(k)
the block A22 of (5.12). The result is that both U and L will have large
elements as k varies over its range from 1, . . . n − 1. Hence, | E | in (5.28) is
“large”, resulting in an inaccurate solution.
The fact that large | L | and | U | lead to an unstable solution can also be
explained in a different way as follows. Consider two different LU decom-
positions on the same matrix A:
1. A = LU (large pivots)
2. A = ΛR (small pivots)
Let us assume that the pivots in the second case are small enough so that
aij = P + N (5.32)
where P , (N ) is the sum of all terms in (5.30) which are positive (negative).
Using arguments similar to those surrounding (5.8), we see that (5.31) im-
plies that both |P |, |N | |aij |. Thus when the pivots are sufficiently small
141
in magnitude, from (5.32) we see that two nearly equal numbers are being
subtracted, which leads to catastrophic cancellation, and ensuing numerical
instability.
Thus, for stability, large pivots are required. Otherwise, even well-conditioned
systems can have large error in the solution, when computed using Gaussian
elimination.
−0.2725 −2.0518 0.5080 1.1275
1.0984 −0.3538 0.2820 0.3502
A=
−0.2779
,
−0.8236 0.0335 −0.2991
0.7015 −1.5771 −1.3337 0.0229
which has been designed so that a single pivot element of very small mag-
nitude on the order of 10−13 appears in the (2, 2) position after the first
stage of Gaussian elimination. The L and U matrices which result after
completing the Gaussian elimination process without pivoting contain el-
ements with very large magnitude, on the order of 1012 . The computed
solution x obtained using the LU decomposition without pivoting, for b =
[−0.6888, 10.0022, −1.3670, −2.1863]T is given as
1.0826
0.9882
x=
1.0622 ,
0.9704
whereas the true solution is [1, 1, 1, 1]T . The relative error in this computed
solution is 0.1082, which may be regarded as significant, depending on the
application. On the other hand, the solution obtained using the Matlab lin-
ear equation solver, which does use pivoting, yields the true solution within
a relative error of approxiately u, which is 2.2204 × 10−16 on the Matlab
platform used to obtain these results.
142
5.3.1 Pivoting
Full pivoting, where both row and column interchanges are performed, is
stable yet expensive, since arithmetic comparisons are almost as costly as
flops, and many comparisons are required to search through the entire A22
block at each stage to search for the element with the largest magnitude.
Note that both row and column permutations take place to swap the re-
spective element into the pivot position. The number of comparisons can
be drastically reduced if only row permutations take place. That is, the
element with the largest magnitude in the leading column of the A22 block
is permuted into the pivot position using only row interchanges. The result,
which is known as partial pivoting, is almost as stable.
Note that the row– and column–interchange operations will destroy the in-
tegrity of the system of the original system of equations Ax = b. In ef-
fect, the matrix A has been replaced by the quantity P AΠ, where P =
P n−1 , . . . , P 1 , and Π = Π1 , . . . , n − 1, where P i (Πi ) is the respective row
(column) permutation matrix at the ith stage of the decomposition. There-
fore, a system of equations which is equivalent to the original can be written
143
in the form
P AΠ ΠT x = P b,
where we have made use of the fact that [Π][ΠT ] = I.1 Thus, for every row
interchange we also exchange corresponding elements of b, and for every
column interchange we exchange corresponding elements of x.
U = DM T
144
Now consider the case where A is positive definite. Define the symmetric
part T and the asymmetric part S of A respectively as:
A + AT A − AT
T= , S=
2 2
It is shown [1] that the computed solution x̂ to a positive definite system of
equations satisfies
(A + E)x̂ = b
where
2 −1
+ O(u2 )
||E||F ≤ u 3n||A||F + 5cn ||T ||2 + ST S 2
(5.33)
T
Because A is symmetric,
√ then√ A = LDL . Because the dii are positive,
then G = L · diag( d11 , . . . , dnn ). Then GGT = A as desired.
145
Cholesky decomposition requires fewer flops than regular LU decomposition,
since a properly–designed algorithm can take advantage of the fact the two
factors are transposes of each other. Further, the factorization does not
require pivoting. Both these points result in significantly reduced execution
times.
Also,
ai1
gi1 = i = 2, . . . , n.
g11
Thus, all elements in first column of G can be solved. Now, consider the
second column. First, we solve g22 :
2 2
g21 + g22 = a22
Thus,
1
2 2
g22 = (a22 − g21 )
where the term in the round brackets is positive if A is positive definite.
Once g22 is determined, all remaining elements in the second column may
146
be found by comparison with corresponding element in the second column
of A. The third and remaining columns are solved in a similar way. If
the process works its way in turn through columns 1, . . . , n, each element
in G is found by solving a single equation in one unknown. Determining
each diagonal element involves finding a square root of a particular quantity.
This quantity is always positive if A is positive definite.
147
The vector process x has the desired covariance matrix because
E(xxT ) = E(GwwT GT )
= GE(wwT )GT
= GGT
= Σ.
xi = S(θ)ai + ni (5.34)
148
We note that as a consequence of this whitening process, the signal compo-
nent has also been transformed by G−1 . Therefore, in the specific case of
the MUSIC algorithm, we must therefore substitute G−1 S for S to achieve
correct results.
Note that in both the above cases, any square root matrix B such that
BB T = Σ will achieve the same effects. However, the Cholesky factor is
typically the easiest one to compute.
But the capability to quantify error does not address the complete problem.
What we also need to know is how sensitive is the solution x to error in the
quantities A and b. In this respect, in this section, we develop the idea of
the matrix condition number κ(A) of a matrix A.
Ax = b (5.37)
A = U ΣV T . (5.38)
Let us now consider a perturbed version Ã, (following the method of (5.28)
or (5.33)), where à = A + E, and as before E is an error matrix and
controls the magnitude of error. In this example let E be taken as the
outer product E = un v Tn . Then, the singular value decomposition of à is
149
identical to that for A, except the transformed σn , denoted σ˜n , is replaced
with σn + .
x = V Σ−1 U T b,
or, using the outer product representation for matrix multiplication we have
n
X ui T b
x= vi . (5.39)
σi
i=1
These examples indicate that a small σn can cause large errors in x. But
we don’t have a precise idea of what “small” means in this context. “Small”
relative to what? The following section addresses this question.
Consider the perturbed system where there are errors in both A and b. Here,
the notation is simpler if we denote the errors as δA and δb1 respectively.
The perturbed system becomes
150
We can write the above as
Ax = b =⇒ x = A−1 b (5.40)
−1
Aδx = δb =⇒ δx = A δb. (5.41)
We now consider what is the worst possible relative error ||δ x||
||x|| in the solution
x in the 2–norm sense. This occurs when the direction of δb from (5.41) is
such that ||δx||2 is maximum, and simultaneously, when b from(5.40) is in
the direction such that the corresponding ||x||2 is minimum.
Note the largest singular value of A−1 is 1/σn , and the smallest is 1/σ1 .
Likewise, the u–vector associated the largest singular value is un , and is
u1 if associated with the smallest singular value. With this in mind, it is
straightforward to show from the ellipsoidal interpretation of the SVD of
A−1 , as shown in Fig. 1 Sect. 3.5 that the maximum of ||δx||2 = ||A−1 δb||2
with respect to δb, for ||δb||2 held constant, occurs when δb aligns with the
vector un ; i.e.,
151
In the second line, we have used the fact that the 2–norm is invariant to
multiplication by the orthonormal matrix V . The third line follows from
the fact that the maximum occurs when δb = ||δb||2 un , and so from the
orthonormality of U , the quantity U T δb is a vector of zeros except for the
first element, which is equal to ||δb||2 .
Using analogous logic, we see that the minimum of ||A−1 b||2 in (5.40) for
fixed ||b||2 occurs when the direction of b aligns with u1 . In this case,
following the same process as in (5.42), except replacing the maximum with
minimum, we have
1
min ||A−1 b||2 = ||b||2 . (5.43)
b σ1
We can now use (5.43) and (5.42) as worst–case values in (5.40) and (5.41)
respectively to evaluate the worst case upper bound on the relative error
||δ x||2
||x||2 in x. We have
||δx||2 σ1 ||δb||2
≤ . (5.44)
||x||2 σn ||b||2
The quantity ||δb||2 in (5.44) may be interpreted as the relative error in A
||b||2
and b. This relative error is magnified by the factor σσn1 to give the relative
error in the solution x. The ratio σσn1 is an important quantity in matrix
analysis and is referred to as the condition number of the matrix A, and is
given the symbol κ2 (A). The subscript 2 refers to the 2–norm used in the
derivation in this case. In fact, the condition number may be derived using
any suitable norm, as discussed in the Appendix of this chapter.
The analysis for this section gives an interpretation of the meaning of the
condition number κ2 (A). It also indicates in what directions b and δb must
point to result in the maximum relative error in x. We see for worst error
performance, δb points along the direction of un , and b points along u1 .
If the “SVD ellipsoid” is elongated, then there is a large disparity in the
relative growth factors in δx and x, and large relative error in x can result.
1. κ(A) ≥ 1.
2. If κ(A) ∼ 1, we say the system is well-conditioned, and the error in
the solution is of the same magnitude as that of A and b.
152
3. If κ(A) is large, then the system is poorly conditioned, and small errors
in A or b could result in large errors in x. In the practical case, the
errors can be treated as random variables and hence are likely to have
components along all the vectors ui , including un . Thus in a practical
situation with poor conditioning, error growth in the solution is almost
certain to occur.
We still must consider how bad the condition number can be before it starts
to seriously affect the accuracy of the solution for a given floating–point
precision. In ordinary numerical systems, the errors in A or b result from
the floating point representation of the numbers. The maximum relative
error in the floating point number is u. The condition number κ(A) is
the worst-case factor by which this floating–point error is magnified in the
solution. Thus, the relative error in the solution x is bounded from above
by the quantity O(uκ(A)). Therefore, if κ(A) ∼ u1 , then the relative error
in the solution can approach unity, which means the result is meaningless.
−r
If κ(A) ∼ 10u , then the relative error in the solution can be taken as 10−r ,
and the solution is approximately correct to r decimal places.
Questions:
153
5
direction of
4
perturbation
of b(2)
3
X(2)
1
direction of
perturbation
0 of b(1)
-1
-2
-3
-3 -2 -1 0 1 2 3
x(1)
Figure 5.2. A poorly conditioned system of equations. The value for b is [1, 1]T , which
is along the same direction as u1 , whereas the perturbations δb in b are [−0.01, 0.01]T ,
which is along the direction of u2 . Visual inspection shows that this arrangement results
in a relatively large shift in the point of intersection of the two lines, which corresponds
to the solution of the system of equations.
The two equations are shown plotted together in Fig. 5.2, where it may
be seen they are close to being co-linear. The singular values of A are
[2.0016, 0.0799], to give a value of κ(A) = 25.0401. The solution to the
unperturbed system is x = [0.5, 0.5]. The U –matrix from the SVD of A is
given as
−0.7060 −0.7082
U= .
−0.7082 0.7060
We now perturb the b–vector in (5.45) in the direction corresponding to the
worst–case error in x, which is along the u2 –axis. The perturbed value of
b is given as bp = b + 0.01u2 , where the value 0.01 was chosen to be the
magnitude of the perturbation. The resulting directions in which b(1) and
b(2) are perturbed are indicated in Fig. 5.2. We note that the unperturbed b
already points along the u1 –direction, which is the direction corresponding
154
to the worst–case error in the solution.
We now solve the perturbed system and compare the corresponding relative
error in the solution with the upper bound given by (5.44), repeated here
for convenience:
||δx||2 σ1 ||δb||2
≤ .
||x||2 σn ||b||2
||δ b||2
The value of on the right, using the current values is 0.0070711.
||b||2
κ(A) = 25.0401, so the worst–case relative error in x predicted by (5.44) is
0.17706. The perturbed solution is x = [0.40807, 0.58485]T , which results
in an actual relative error of [0.17692], which is seen to be very close to the
worst–case upper bound predicted by (5.44), as it should be in this case.
155
6
x(2)
1
-1
-2
-3
-4
-3 -2 -1 0 1 2 3
x(1)
denote the respective condition number as κ(B n−1 ). Now we add a column
and row to form B n (in such a way so that B n remains symmetric). The
largest and smallest eigenvalues of B n are now λ1 (B n ) and λn (B n ). We
can infer from the interlacing theorem that
λ1 (B n ) ≥ λ1 (B n−1 ), and
λn (B n ) ≤ λn−1 (B n−1 )
where we have set the value of r above equal to n − 1 in each case. These
equations imply that κ(B n ) ≥ κ(B n−1 ). This means that increasing the
size of a square symmetric matrix, the condition number does not decrease,
and only under special conditions does it remain unchanged.
156
mitigating the effect of a poor condition number when solving a system of
equations.
Appendices
We now develop the idea of the condition number, which gives us a precise
definition of the sensitivity of x to changes in A or b in eq. (5.37). Now
consider the perturbed system
(A + F)x() = b + f (5.46)
where
is a small scalar
F ∈ Rn×n and f ∈ Rn are errors
x() is the perturbed solution, such that x(0) = x.
We wish to place a lower bound on the relative error in x due to the pertur-
bations. Since A is nonsingular, we can differentiate (5.46) implicitly wrt
:
(A + F)ẋ() + Fx() = f (5.47)
For = 0 we get
ẋ(0) = A−1 (f − Fx). (5.48)
The Taylor series expansion for x() about = 0 has the form:
157
Hence by taking norms, we have
where the triangle inequality has been used; i.e., ||A + b|| ≤ ||A|| + ||b||.
Using the property of p–norms, ||Ab|| ≤ ||A|| ||b||, we have
δ
3. ||F|| is the relative error in A = ρA
||A||
||x() − x||
≤ κ(A)(ρA + ρB ) + O(2 ) (5.52)
||x||
158
Thus we have the important result: Eq.(5.52) says that, to a first-order
approximation, the relative error in the computed solution x is bounded by
the expression κ(A)× (relative error in A + relative error in b). This is
a rather intuitively satisfying result. Thus the condition number κ(A) is
the maximum amount the relative error in A + b is magnified to give the
relative error in the solution x.
v Ti Rv i
λi = . (5.54)
v Ti v i
where vik denotes the kth element of the ith eigenvector v i matrix V , and
r(k − m) is the (k, m)th element of R. Using the Wiener–Khintchine rela-
159
tion2 we may write
Z π
1
r(k − m) = S(ω)ejω(k−m) dω. (5.56)
2π −π
where S(ω) is the power spectral density of the process. Substituting (5.56)
into (5.55) we have
n n Z π
1 XX
v Ti Rv i = vik vim S(ω)ejω(k−m) dω
2π −π
k=1 m=1
Z π n n
1 X X
= S(ω)dω vik ejωk vim e−jωm . (5.57)
2π −π
k=1 m=1
2
This relation states that the autocorrelation sequence r(·) and the power spectral
density S(ω) are a Fourier transform pair [8].
160
Let Smin and Smax be the absolute minimum and maximum values of S(ω)
respectively. Then it follows that
Z π Z π
jω 2
| Vi (e ) | S(ω)dω ≥ Smin | Vi (ejω ) |2 dω (5.62)
−π −π
and Z π Z π
| Vi (ejω ) |2 S(ω)dω ≤ Smax | Vi (ejω ) |2 dω (5.63)
−π −π
Hence, from (5.61) we can say that the eigenvalues λi are bounded by the
maximum and minimum values of the spectrum S(ω) as follows:
Smax
κ(R) ≤ . (5.65)
Smin
161
5.9 Problems
5. A white noise process is fed through a low–pass filter with cutoff fre-
quency of fo and a monotonic rolloff characteristic. The process is
sampled in accordance with Nyquist’s criterion at a frequency fs only
slightly larger than 2fo . The covariance matrix R1 of this process is
evaluated. Then, the sampling frequency is increased well above the
value 2fo and the resulting covariance matrix R2 is again evaluated.
Compare the condition number of R1 with that of R2 and explain
your reasoning carefully. Hint: Consider Sect. 5.8.
6. Give bases for the row and column subspaces of A in terms of its L
and U factors.
162
7. On the course website, you will find a .mat file named Ch5Prob7.mat.
It contains a very poorly conditioned matrix A and a vector b.
163
164
Chapter 6
The QR Decomposition
A = QR
165
For m ≥ n, we can partition the QR decomposition in the following manner
R1 n
m A = Q1 Q2
0 m−n (6.1)
n
n n m−n
This follows from the fact that since R is upper triangular, the column
ak is a linear combination of the columns [q 1 , . . . , q k ] for k = 1, . . . , n.
R(A) = R(Q1 )
R(A)⊥ = R(Q2 )
AR−1
1 = Q1 . (6.3)
166
L__�-----+--'----p- g__,
q2 ⊥ q1. (6.6)
From Fig. 1, we may satisfy (6.4)–(6.6) by considering a vector p2 , which is
the projection of a2 onto the orthogonal complement subspace of q 1 . The
vector p2 is thus defined as
p2 = P ⊥ T
2 a2 = (I − q 1 q 1 )a2 .
167
Then, the vector q 2 is determined by normalizing the 2–norm of p2 :
p2
q2 = .
||p2 ||2
∆
We define the matrix Q(k) = [q 1 . . . , q k ]. Then the second column r 2 of R1
contains the coefficients of a2 relative to the basis Q(2) . Thus,
T
r 2 = Q(2) a2 .
and
pk
qk =
||pk ||2
∆
We now define Q(k) = [Q(k−1) , q k ]. The column r k is then defined as
T
r k = Q(k) ak .
168
6.1.1 Modified G-S Method for QR Decomposition
We are given a matrix A ∈ Rm×n , m > n. In this case, (as with classical
Gram-Schmidt) the matrix Q which is obtained contains only n columns.
where matlab notation has been used. We see that the first column on
the right in (6.7) is now completely determined. We can proceed to the
second stage of the algorithm by forming a matrix B by subtracting this
first column from both sides of (6.7):
B (1) = A − q 1 r T1 .
Since a1 = r11 q 1 , the first column of B (1) is zero. We then have from (6.7):
b2 = r22 q 2
b3 = r23 q 2 + r23 q 3
.. .. .. .. (6.8)
. . . .
bn = r2n q 2 + r3n q 3 . . . rnn q n
169
From (6.8) it is evident that the column q 2 and row r T2 may be formed from
B (1) in exactly the same manner as q 1 and r T1 were from A. The method
proceeds n steps in this way until completion.
To formalize the process, assume we are at the kth stage of the decomposi-
tion. At this stage we determine kth column of Q = q k and the kth row of
R = r Tk . We define the matrix A(k) in the following way:
We partition A(k) as
A(k) = [z B (k) ] m
1 n−k
This situation above corresponds to having just subtracted out the (k − 1)th
column in (6.7). Then,
rkk = ||z||2
and
z
qk =
.
rkk
The kth row of R may then be calculated as:
This method, unlike the classical Gram Schmidt, is very stable. The nu-
merical stability results from the fact that errors in q k at the k th stage are
not compounded into succeeding stages. It also requires the same number
of flops as classical G-S. It may therefore be observed that modified G-S
is a very attractive method for computing the QR decomposition, since it
has excellent stability properties, coupled with relatively few flops for its
computation.
170
L__�-----+--'----p- g__,
x⊥ = (I − P )x
H = I − 2P (6.9)
171
The H matrices may be used to zero out selected components of a vector.
For example, by choosing the vector v in the appropriate fashion, all ele-
ments of a vector x may be zeroed, except the first, x1 . This is done by
choosing v so that the reflection of x in span(v)⊥ lines up with the x1 –axis.
Thus, in this manner, all elements of x are eliminated except the first.
A1 = H 1 A
has zeros below the main diagonal in both the first and second columns.
This may be done by designing H 2 so that the first column of H 2 A1 is the
same as that of A1 , and so that the second column of H 2 A1 is zero below
the main diagonal.
∆
R = An−1 = H n−1 . . . H 1 A (6.10)
Let us now consider the first stage of the Householder process. Extension
to other stages is done later. How do we choose P (or more specifically v)
so that y = (I − 2P )x has zeros in every position except the first, for any
x ∈ Rn ? That is, how do we define v so that y = Hx is a multiple of e1 ?
172
Here goes:
Hx = (I − 2P )x
I − 2v(v T v)−1 v T x
=
2v T x
= x− v. (6.11)
vT v
Householder made the observation that If v is to reflect the vector x onto
the e1 -axis, then v must be in the same plane as that defined by [x, e1 ], or
in other words, v ∈ span(x, e1 ). Accordingly, we set v = x + αe1 , where α
is a scalar to be determined. At this stage, this asignment may appear to be
rather arbitrary, but as we see later, it leads to a simple and elegant result.
Substituting this definition for v into (8.33), where x1 is the first element of
x, we get
v T x = xT x + αx1
v T v = xT x + 2αx1 + α2 .
Thus,
2v T x
Hx = x − T [x + αe1 ]
v v
2(xT x + αx1 ) vT x
= 1− T x − 2α e1 (6.12)
x x + 2αx1 + α2 vT v
To make Hx have zeros everywhere except in the first component, the first
term above is forced to zero. If we set α = ||x||2 , then the first term is:
2 ||x||22 + ||x||2 x1
1 − = 0
||x||22 + 2 ||x||2 x1 + ||x||22
Hx = − ||x||2 e1 . (6.13)
Note that we could also have achieved the same effect by setting α = −||x||2
in (8.34). The choice of sign of α affects the numerical stability of the
173
algorithm. If x is close to a multiple of e1 , then v = x − sign(x1 )||x||2 e1
∆
has small norm; hence large relative error can exist in factor β = v T2 v . This
difficulty can be avoided if the sign of α is chosen as the sign of x1 (first
component of x); i.e2 .,
v = x + sign(x1 )||x||2 e1 . (6.14)
The corresponding matrix H is given from the second line of (8.33) as
2vv T
H=I− . (6.15)
(v T v)
What is H such that the only non–zero element of Hx is in the first posi-
tion? That is, Hx ∈ span {e1 }. The process is very simple.
= I − 0.21132 1 + √3 1 1
1+ 3 1 1
−0.57734 −0.57734 −0.57734
= −0.57734 0.78868 −0.21132
−0.57734 −0.21132 0.78868
2
sign(x) = +1 if x is positive, and -1 if x is negative.
174
We see that
−1.73202
Hx = 0
0
which is exactly the way it is supposed to be. Note from this example, Hx
has the same 2–norm as x. This is a consequence of (6.13), which itself
follows from the orthonormality of H.
[1]
1 < k < j ≤ n, x ∈ Rn .
175
where H = I − 2vv T /v T v is the Householder matrix formed by v’s non
trivial portion, then we have in this case,
Hx = [ x ,...,x , 0, 0, . . . 0, xj+1 , . . . , xn ]T
| 1 {z k−1} −sign(xk )α, | {z } | {z }
these elements 0’s in desired these elements
are unchanged positions also unchanged
2
Let β = , and v̂ and β̂ be the computed versions of v and β respectively.
vT v
Then,
Ĥ = I − β̂v̂v̂T
The matrix HAh hasi a block of zeros in a desired location. The floating
point matrix f l ĤA satisfies
h i
f l ĤA = H(A + E)
where
||E||2 ≤ cp2 u||A||2
c is a constant of order 1
p is the number of elements which are zeroed.
176
6.3 The QR Method for Computing the Eigende-
composition
Au = λu.
Thus we see that C has the same eigenvalues as A, and the eigenvectors u
of C are transformed versions of those of A.
A(0) = Q(0)R(0),
A(1) = Q(1)R(1)
and again reverse the factors to form A(2), and continue iterating in this
fashion. The eigenvalues at each stage are identical to those of A(0). At
each successive stage of this process, A(k) becomes increasingly diagonal,
and eventually, A(k) becomes completely diagonal for large enough k, thus
177
revealing the eigenvalues. An extensive discussion on the convergence char-
acteristics of the QR procedure is presented in Golub and Van Loan [1].
178
A square symmetric matrix can be converted to tridiagonal form with one
similarity transform. ( Tridiagonal means that only the main and first upper
and lower diagonals are non–zero. ) The process is quite straightforward.
The original matrix A is replaced with A(0) as follows:
One might well ask “If we can tridiagonalize A with one similarity transform,
then why can’t we completely diagonalize A in one step? The answer lies in
the fact that we can indeed find a Qo so that Qo A has zeros in all positions
below the main diagonal. The problem is that the post–multiplication by
QTo as in (6.16) overwrites the zeros that result from the pre–multiplication.
As an example, we take the following Toeplitz matrix for A:
4 3 2 1
3 4 3 2
A=
2
(6.17)
3 4 3
1 2 3 4
179
which has zeros below the main diagonal in the first colunm as desired.
However, when we post–muliptly by H T , we get
9.7333 −1.0275 −1.9022 −2.0465
−1.0275 0.8757 0.5318 0.4192
HAH T = −1.9022
,
0.5318 2.0977 1.8177
−2.0465 0.4192 1.8177 3.2933
and thus it is apparent that the zeros introduced in (6.18) have been over-
written by the later post–multiplication by H T and the procedure has not
accomplished our objective to introduce as many zeros as possible into A(0).
So now we accept the fact that the best we can do is to tridiagonalize. As
a first step in this respect, we formulate a Householder matrix H to wipe
out the elements below the first lower diagonal in the first column. This is
given by
1.0000 0 0 0
0 −0.8018 −0.5345 −0.2673
H= .
0 −0.5345 0.8414 −0.0793
0 −0.2673 −0.0793 0.9604
Notice that the selective elimination procedure as in Sect. 6.2.3 has been
used, since we wish to keep the element (1,1) intact. This explains the
identity structure in the first row and column of H. The second column
below the first lower diagonal is eliminated in a corresponding fashion. The
overall result of the tridiagonalization procedure is then given by
4.0000 −3.7417 −0.0000 −0.0000
−3.7417 8.2857 −2.6030 −0.0000
HAH T = −0.0000 −2.6030
.
3.0396 −0.2254
−0.0000 −0.0000 −0.2254 0.6747
We see that the only non–zero elements are indeed located along the three
main diagonals, as desired. After tridiagonalization, regular QR iterations
as described above are applied, but now only one element in each column
requires elimination, a fact that greatly speeds up both convergence and
execution of the algorithm. The tridiagonal structure is maintained at each
QR iteration.
180
below the first lower diagonal. This is the so–called upper Hessenburg form
of the matrix. The overall process requires somewhat more computation
than the symmetric case.
= R(k)Q(k) + αk I
= A(k + 1).
3
By cubic convergence, we mean that if the error at iteration k is k for suitably large
k, (where k may be assumed small) then the error is o(3k ) at iteration k + 1. Thus the
convergence is very fast.
181
Appendices
In this Appendix, we examine two additional forms of QR decomposition
– the Givens rotation and fast Givens rotation methods. They are both
useful methods, particularly when only specific elements of A need to be
eliminated.
We have seen so far in this lecture that the QR decomposition may be ex-
ecuted by the Gram Schmidt and Householder procedures. We now discuss
the QR decomposition by Givens rotations. A Givens transformation (ro-
tation) is capable of annihilating a single zero in any position of interest.
Givens rotations require a larger number of flops compared to Householder
to compute a complete QR decomposition on a matrix A. Nevertheless,
they are very useful in some circumstances, because they can be used to
eliminate only specific elements.
182
eliminated; thus,
J n,n−1 · · · J n2 · · · J 32 Jn1 · · · J 21 A = R.
| {z }
QT upper triangular
183
Thus,
s aik
= tan θ = .
c akk
Notice that θ is not explicitly computed . The matrix J (i, k, θ) is now com-
pletely specified.
The following algorithm computes the c and s in the most stable numerical
fashion and ensures that J (i, k))x has a 0 in the (i, k)th position:
If |xk | ≥ |xi |
xi 1
then t := xk ; s := 1 ; c := st
(1 + t2 ) 2
xk 1
else t := xi ; c := 1 ; s := ct
(1 + t2 ) 2
This algorithm assures that |t| ≤ 1. If |t| becomes large, we may run into
stability problems in calculating c and s.
It is easily verified that the following facts hold true when evaluating the
product J (i, k)A:
where aTi , aTk are the ith and kth rows of A respectively. Thus, only the ith
and kth rows of the product J A are actually relevant in the Givens analysis.
184
The order in which elements are annihilated in the QR decomposition is
critical. It is explained with the aid of the following diagram:
a11 a12
a13 ...
a21 a22
a31 a23
a32
a33
..
a42
.
a43
..
..
.
. ..
.
am1 am2
am3
↑ ↑
↑
elements a21 . . . am1 are elements a32 . . . am2
then a43 . . . an3
first annihilated by are annhilated next by linear
etc.
linear comb. with 1st row comb. with 2nd row
i.e., k = 1, i = 2, . . . , m − 1. i.e., k = 2, i = 3, . . . , m − 1.
If the ordering indicated by the above diagram is not followed, then previ-
ously written zeros may be overwritten by non-zero values in later stages.
185
k = 1. This is done using Givens rotations in the following way:
k i (1)
c ··· s 0 ··· 0 x1 x1
k .. .. .. ..
. . . .
i −s c xi = 0
.. ..
0 1
. .
.. ..
. . xn (1)
xn
0 1
where the bracketed superscript indicates the corresponding element has
been changed once, and
x1 xi
c= q s= q
x21 + x2i x21 + x2i
The difficulty with the original Givens method is that generally, none of the
elements of the J –matrix at the 2 × 2 level in (6.20) are 0 or 1. Thus, the
update of a given element from (6.20) involves 2 multiplications and one add
for each element in rows k and i. We now consider a faster form of Givens
where the off– diagonal elements of the transformation matrix are replaced
by ones. This reduces the number of explicit multiplications required for
the evaluation of each altered element of the product from two to one.
In this vein, let us consider Fast Givens: the idea here is to eliminate each
element of a using a simplified transformation matrix, denoted as M , to
reduce the number of flops required over ordinary Givens. The result is that
the M used for fast Givens is orthogonal but not orthonormal.
186
where A ∈ <m×n , m > n, S ∈ <n×n is upper triangular, and M ∈ <m×m
has orthogonal but not orthonormal columns.
Hence
M T M = D = diag(d1 . . . dm ), (6.22)
1
and M D − 2 is orthonormal.
We deal with the fast Givens problem at the 2 × 2 level. Let x = [x1 x2 ]T ,
and we define the matrix M 1 as
β1 1
M1 = . (6.23)
1 α1
187
At the m × m level, the matrix M1 has a form analogous to slow Givens:
1 ··· 0 ··· 0 ··· 0
.. . .
.
.
0 · · · β1 · · · 1 ··· 0
k
..
M1 (i, k) =
.
(6.26)
i
0 · · · 1 · · · β1 ··· 0
..
.
0 1
k i
For the sake of interest, let us see how the fast Givens decomposition may
be used to solve the LS problem. From (6.21) and (6.22), and using the fact
1
that M D − 2 is orthonormal, we can write
1 1
||Ax − b||2 = D − 2 M T Ax − D − 2 M T b
2
− 1 S c
= D 2 x− (6.27)
0 d
2
188
where
c n
MT b = .
d m−n
The great advantage to the fast Givens approach is that the triangularization
may be accomplished using half the number of multiplications compared to
slow Givens, and may be done without square roots, which is good for VLSI
implementations.
The following table presents a flop count for various methods of QR decom-
position of a matrix A ∈ Rm×n :
Householder: 2n2 (m − n/3) 1 flop = 1 floating-pt. op. (add, mult, div or subt.)
slow Givens: 3n2 (m − n/3)
fast Givens: 2n2 (m − n/3)
Gram-Schmidt 2mn2
2 3
by comparison, Gauss: 3n .
189
6.6 Problems
A1 A2 = R11 R12 k
m Q1 Q2
0 R22 m−k
k n−k
k n−k k m−k
5. (From Strang): Show that for any two different vectors x and y of the
same length, the choice v = x − y leads to a Householder transforma-
tion such that Hx = y and Hy = x.
190
6. Updating the QR decomposition with time: At a certain time t, we
have available m row vectors aTi ∈ Rn and their corresponding desired
values bi , for i = 1, . . . , m, to form the matrix At ∈ Rm×n and bt ∈ Rm .
The QR decomposition QTt At = Rt is available at time t to aid in the
computation of the LS problem minx ||Ax − b||. At time t + 1 a new
(m + 1)th row aTm+1 of A and a new (m + 1)th element of b become
available. Explain in detail how to update Qt and Rt to get Qt+1 and
Rt+1 . Hint: At+1 can be decomposed as
Qt v Rt
At+1 =
zT 1 aTm+1
191
192
Chapter 7
We start off with a quick look at a few applications of least squares, and go
on to develop the LS model. We then develop the so-called normal equations
for solving the LS problem. We discuss several statistical properties of the
LS solution including the Cramer–Rao lower bound (CRLB). We look at
the performance of the LS estimates relative to the CRLB in the presence
of white and coloured noise. We show that in the coloured noise case, per-
formance is degraded, and so we consider various methods for whitening the
noise, which restore the performance of the LS estimator.
Because least squares is such an important topic, it is the focus of the next
four chapters. In Chapter 8, we discuss LS estimation when the matrix
A is poorly conditioned or rank deficient. Then we extend this treatment
in Chapter 9 to discuss latent variable methods, which are useful for mod-
elling poorly conditioned linear systems. Specifically, we deal with the case
where we wish to predict system responses given new input values. Then
193
in Chapter 10, we discuss the important concept of regularization, which is
an additional method for mitigating the effects of poor conditioning when
modelling linear systems.
1
Here and in the sequel, for simplicity of notation, we use subscript notation to imply
the quantity x(iT ).
194
Figure 7.1. A block diagram of an equalizer in a communications system.
far as possible. Thus ideally, the symbols zi at the output of the equalizer
are equal to the corresponding transmitted symbols yi plus noise. For more
details on this topic, there are several good references on equalizers and
digital communications systems at large, e.g., [16].
di = zi + ei
Xn
di = ak x(i − k) + ei . (7.1)
k=1
where in the last line we have made use of the fact that the sequence z[n]
is the convolution of the input sequence x[n] with the sequence a[n], as is
evident from Figure 7.1. If we observe (7.1) over m sample periods we obtain
a new equation in the form of (7.1) for every value of the index i = 1, . . . , m,
where m > n. We can combine these resulting m equations into a single
195
matrix equation:
d = X a + e
(7.2)
(m×1) (m×n) (n×1) (m×1)
196
an excellent example of a time varying AR process, where the vocal tract acts
as a time–varying, highly resonant all–pole acoustic filter whose input is a
pulse train generated by the vocal cords, or a white noise sequence generated
by a restriction somewhere in the vocal tract. During the production of a
single phoneme, which represents an interval of about 20 msec, the voice
may be considered approximately stationary and is therefore amenable to
AR modelling. At an 8 KHz sampling rate, which is a typical value in
telephone systems, there are 160 samples in this 20 msec interval, and an
AR model representing this sequence typically consists of about 10 ∼ 15
parameters. Therefore the sequence of 160 samples can be compressed into
this range of parameters by AR modelling, thus achieving a significant degree
of compression. The AR model must be updated in roughly 20 msec intervals
to track the variation in phoneme production of the voice signal.
Let W (z) and Y (z) denote the z-transforms of the input and output se-
quences, respectively. Then
Y (z) 1
H(z) = =
1 − hi z −i
P
W (z)
or
n
" #
X
Y (z) 1 − hi z −i = W (z).
i=1
We note the expression on the left is a product of z–transforms, so the corre-
sponding time–domain expression involves the convolution of the sequence
[1, −h1 , −h2 , . . . , −hn ] with [y1 , y2 , . . .]. The equivalent of Eq. (8.31) in the
time domain is therefore
Xn
yi − hk yi−k = wi
k=1
or
n
X
yi = hk yi−k + wi , (7.5)
k=1
197
where the variance of the sequence w[i] is σ 2 . From (7.5), we see that the
output of an all-pole filter when driven by white noise may be given the
interpretation that the present value of the output is a linear combination
of past outputs weighted by the denominator coefficients, plus a random
disturbance. The closer the poles of the filter are to the unit circle, the
more resonant is the filter, and the more predictable is the present output
from its past values.
y =Yh+w (7.6)
Eq. (7.6) is of the same form as (7.2). So again, it makes sense to choose
the h’s in (7.6) so that the predicting term Y h is as close as possible to the
true values y in the 2-norm sense. Hence, as before, we choose the optimal
h0 as the solution to
Notice that if the parameters h and the variance σ 2 are known, the autore-
gressive process is completely characterized.
198
of years. We also have information whether the respective observation led to
a hurricane developing or not (let’s say +1 for developed, -1 for not). Here,
the set of conditions for the ith observation, i = 1, . . . m forms a row of a
matrix A, and bi is the corresponding response [±1]. We can formulate this
situation into a linear mathematical model for this problem as follows:
b = Ax + e (7.8)
where e is the error between the linear model Ax and the observations b.
We can expand each of the matrix/vector quantities for clarity as follows:
±1 a11 a12 . . . a1n x1
±1 a21 a22 . . . a2n x2
b= . = . .. + e,
. . .. . . ..
. . . . . .
±1 am1 am2 . . . amn xn
where the aij elements are the jth variable of the ith observation. As in the
previous examples, we wish to determine a set x of predictors (or weights
for each of the variables), which give the best fit between the model Ax and
the observation b. The predictors x? may be determined as the solution to
b̂ = aTnew x?
where “hat” denotes an estimated value. Typically the value of b̂ will not
be ±1 as it would be in the ideal case. But in practice a good prediction
could be made by declaring a hurricane will develop if b̂ ≥ T , and not
develop otherwise, where T is some suitably–chosen threshold such as, e.g.,
the value zero.
It is now apparent that these examples all have the same mathematical
structure. Let us now provide a standardized notation. We define our
199
regression model , corresponding to (7.2), (7.6) or (7.8) as:
b = Ax + n (7.9)
200
Figure 7.2. A geometric interpretation of the LS problem for the one-dimensional case.
The sum of the squared vertical distances from the observed points to the line b = xa is
to minimimized with respect to the variable (slope) x.
201
We define the minimum sum of squares of the residual ||AxLS − b||22
as ρ2LS .
If r = rank(A) < n, then there is no unique xLS which minimizes
||Ax − b||2 . However, the solution can be made unique by considering
n
only that element of the set xLS ∈ R ||AxLS − b||2 = min which
itself has minimum norm.
∆
Let us define the quantity c = AT b. This implies that the component ck of
c is aTk b, k = 1, . . . , n, where aTk is the transpose of the kth column of A.
202
Thus t2 (x) = −xT c. Therefore,
d d
t2 (x) = (−xT c) = −ck = −aTk b, k = 1, . . . , n. (7.15)
dxk dxk
Combining these results for k = 1, . . . , n back into a column vector, we get
d d
t2 (x) = (−xT AT b) = −AT b. (7.16)
dx dx
Since Term 3 of (7.14) is the transpose of term 2 and both are scalars, the
terms are equal. Hence,
d
t3 (x) = −AT b. (7.17)
dx
203
7.2.1 Interpretation of the Normal Equations
AT (b − AxLS ) = 0 (7.21)
or
AT r LS = 0 (7.22)
where
∆
r LS = b − AxLS (7.23)
is the least–squares residual vector between AxLS and b. Thus, r LS must
be orthogonal to R(A) for the LS solution, xLS . Hence, the name “normal
equations”. This fact gives an important interpretation to least-squares
estimation, which we now illustrate for the 3 × 2 case. Eq. (7.9) may be
expressed as
x1
b = [a1 , a2 ] + n.
x2
The above vector relation is illustrated in Fig. 7.3. We see from (7.22) that
the point AxLS is at the foot of a perpendicular dropped from b into R(A).
The solution xLS are the coefficients of the linear combination of columns
of A which equal the “foot vector”, AxLS .
where P is the projector onto R(A). Thus, we see from another point of view
that the least-squares solution is the result of projecting b (the observation)
onto R(A).
It is seen from (7.9) that in the noise-free case, the vector b is equal to the
vector AxLS . The fact that AxLS should be at the foot of a perpendicular
from b into R(A) makes intuitive sense, because a perpendicular is the
shortest distance from b into R(A). This, after all, is the objective of the
LS problem as expressed by eq. (7.10).
204
Figure 7.3. A geometric interpretation of the LS problem for the 3 × 2 case. The red
cross-hatched region represents a portion of R(A). According to (7.21), the point AxLS
is at the foot of a perpendicular dropped from b into R(A).
205
There is a further point we wish to address in the interpretation of the
normal equations. Substituting (7.25) into (7.23) we have
r LS = b − A(AT A)−1 AT b
= (I − P )b
= P ⊥ b. (7.26)
We can now determine the value ρ2LS , which is the squared 2–norm of the
LS residual:
∆
ρLS = ||r LS ||22 = ||P ⊥ b||22 . (7.27)
206
We naturally want to reduce the error in the parameter estimates as far
as possible, but to do this, we need to quantify the error itself. Two such
measures for this purpose are bias and covariance. The bias of an estimated
parameter vector θ is defined as E(θ̂ − θ o ), where the expectation is taken
over all possible values of the parameter estimate θ̂, and θ o is the true value
of the parameter. The covariance matrix of the parameter estimate is given
as T
cov(θ̂) = E θ̂ − E(θ̂) θ̂ − E(θ̂) . (7.29)
= xo , (7.32)
which follows because n is zero mean from assumption A1. Therefore the
expectation of x is its true value, and xLS is unbiased.
207
7.3.2 Covariance Matrix of xLS
−1 T
From (7.31) and (7.30), xLS −E(xLS ) = AT A A n. Substituting these
values into (7.29), we have
h −1 T −1 i
cov(xLS ) = E AT A A nnT A AT A
(7.33)
From assumption A2, we can move the expectation operator inside. There-
fore,
−1 T −1
cov(xLS ) = AT A A E nnT A AT A
| {z }
σ2 I
−1 −1
= AT A AT (σ 2 I)A AT A
−1
= σ 2 AT A (7.34)
The geometry relating to LS variance is shown in Fig. 7.4 for the one–
dimensional case. Here, the normal equations (7.19) devolve into the form
xLS = a aT b , and the variance expression (7.34) for var(x ) becomes σ 2 /(aT a).
Ta LS
(Here a is denoted in lower case since it is a vector, and x in unbolded format,
since it is a scalar). For part (a) of the figure, we see the range of a–values
is relatively compressed, in which case aT a is small, whereupon var(xLS ) is
large. This fact is evident from the figure, in that the slope estimates xLS
will vary considerably over different sample sets of the observations (bi , ai ),
208
having the same noise variance and the same spread ∆a. On the other hand,
we see from part (b) that, due to the larger spread ∆a in this case, the slope
estimate is more stable over different samples of observations with the same
noise variance.
which may be acertained to be the expected value of b given xLS and the
new data aTN . The problem of predicting responses to new observations is
treated at length in Chapter 9.
The question arises, “How good is this estimate of b̂”? To address this issue,
we evaluate the variance of the prediction b̂. We define bo as bo = aTN xo ,
where xo = ExLS . The variance σb2 of b̂ is calculated as follows:
where we have used (7.35) in the second line. Since aTN are measured and
not random variables, the expectation operator can be moved to the inside
set of brackets. This expectation is given by (7.34). Therefore we can write
We note that the predicted value b̂ is also dependent on the quantity (AT A)−1 ,
and therefore as we have seen, one small eigenvalue can result in large vari-
ances in the estimate b̂. Various latent variable approaches for mitigating
this effect are dicussed in Chapter 9.
209
Figure 7.4. The geometry of LS variances. In the top figure, the observations (dots) are
spread over a narrow range of a–values, giving rise to large variation in slope estimates
over different samples of the observations. In the lower figure, the spread of a–values
is larger, giving rise to lower variances in the slope estimates over different samples of
observations. The magnitude of the slope variation is depicted by the magnitude of the
arc’d arrows.
210
7.3.4 xLS is a BLUE (aka The Gauss–Markov Theorem)
x̃ = Bb (7.37)
E(x̃) = BAxo .
BA = I. (7.39)
x̃ = xo + Bn.
= E BnnT B T
= σ 2 BB T , (7.40)
211
We now consider a matrix Ψ defined as the difference of the estimator matrix
B and the least–squares estimator matrix (AT A)−1 AT :
Ψ = B − (AT A)−1 AT
We note that the diagonal elements of a covariance matrix are the vari-
ances of the individual elements. But from (7.40) and (7.34) we see that
σ 2 BB T and σ 2 (AT A)−1 are the covariance matrices of x̃ and xLS respec-
tively. Therefore, (7.42) tells us that the variances of the elements of x̃ are
never better than those of xLS . Thus, within the class of linear unbiased es-
timators, and under assumptions A1 and A2, no other estimator has smaller
variance than the L–S estimate.
We see later in this chapter that at least one small eigenvalue of the matrix
AT A can cause the variances of xLS to become large. This undesirable
situaiton can be mitigated by using the pseudo–inverse method discussed
in the following chapter. However, the pseudo–inverse introduces bias into
the estimate. In many cases, the overall error (a combination of bias and
variance) is considerably reduced with the pseudo–inverse approach. Thus,
the idea of an unbiased estimator is not always desirable and there may
be biased estimators which perform better on average than their unbiased
counterparts.
2
The notation (·)ij means the (i, j)th element of the matrix argument.
212
7.4 Least Squares Estimation from a Probabilistic
Approach
b = Axo + n. (7.43)
Here we assume the more general case where the covariance of the noise is Σ.
Given the observation b, and if A and xo and Σ are known, then under the
current assumptions b is a Gaussian random variable with mean Axo and
covariance matrix Σ. (Recall this distribution is denoted as N (Axo , Σ)).
Since the multivariate Gaussian pdf is completely described in terms of its
mean and covariance, then
n 1
h T i
p(b|A, xo , Σ) = (2π)− 2 |Σ|− 2 exp − b − Axo Σ−1 b − Axo (7.44)
We also investigate an additional pdf, which is that of xLS given all the
parameters. It is a fundamental property of Gaussian–distributed random
variables that any linear transformation of a Gaussian–distributed quantity
is also Gaussian. From (7.24) we see that xLS is a linear transformation of
b, which is Gaussian by hypothesis. Since we have seen that the mean of
xLS is xo and the covariance specifically for the white noise case from (7.34)
−1
is σ 2 AT A , then xLS has the Gaussian pdf given by
−n −2 T 1 1 T T
p(xLS |xo , A, σ) = (2π) |σ A A| exp − 2 (xLS − xo ) A A(xLS − xo ) .
2 2
2σ
(7.45)
Recall from the discussion of Sect. 4.3 that the joint confidence region (JCR)
of xLS is defined as the locus of points ψ where the pdf has a constant value
with respect to variation in xLS . These JCR’s are elliptical in shape. The
213
probability level α of an observation falling within the JCR is the integral
of the interior of the ellipse. Since the variable xLS appears only in the
exponent, the set ψ is defined as the set of points xLS such that the quadratic
form in the exponent (and hence the pdf itself) is equal to a constant – that
is, the JCR ψ is defined as
1
ψ = xLS 2 (xLS − xo )T AT A(xLS − xo ) = k
(7.46)
2σ
where the value k is determined from the probability level α.
1 T
− z Λz
2σ 2
∆
where z = V T (xLS − xo ) and V ΛV T is the eigendecomposition of AT A.
The length of√ the ith principal axis of the associated ellipse is then propor-
tional to 1/ λi . This means that if a particular eigenvalue is small, then
the length of the corresponding axis is large, and z has large variance in the
direction of the corresponding eigenvector v i , as shown in Fig. 7.5. It may
be observed that if v i has significant components along any component of
xLS , then these components of xLS have large variances too. From Fig. 7.5,
it is seen that λ2 is smaller than λ1 , which causes large variation along the
v 2 –axis, which in turn causes large variances on both the x1 and x2 axes.
On the other hand, if all the eigenvalues are larger, then the variances of z,
and hence xLS , are lower in all directions. This situation is shown in Fig.
7.6, where it is seen in this case the eigenvalues are well–conditioned and all
relatively large. In this case, the variation along the co–ordinate axes has
been considerably reduced.
We see that the variances of both the x1 and x2 components of xLS are large
due to only one of the eigenvalues being small. Generalizing to multiple
dimensions, we see that if all components of xLS are to have small variance,
then all eigenvalues of AT A must be large. Thus, for desirable variance
properties of xLS , the matrix AT A must be well– conditioned; i.e., the
condition number of AT A, (as discussed in Sect. 5.4) should be as close to
unity as possible, and the eigenvalues be as large as possible. This is the
“sense” referred to earlier in which the matrix AT A must be “big” in order
for the variances to be small.
214
Figure 7.5. The blue ellipse represents a joint confidence region at some probability level
α, where the semi–axes have lengths proportional to √1 as shown. The fact that λ2 is
λi
relatively small in this case causes large variation along the v 2 –axis, which in turn causes
large variation along each of the co–ordinate axes, leading to large variances of the LS
estimates.
215
Figure 7.6. A joint confidence region similar to that in Fig. 7.5, but where the eigenvalues
are better conditioned and relatively large. In this case we see that the variation along
the co–ordinate axes is considerably reduced.
216
From the above, we see that one small eigenvalue has the ability to make
the variances of all components of xLS large. In the following chapters,
we present various methods for mitigating the effect of a small eigenvalue
destroying the desirable variance properties of xLS .
∂ 2 ln p(b|A, Σ,x)
(J )ij = −E . (7.47)
∂xi ∂xj
217
where j ii denotes the (i, i)th element of J −1 . Because the diagonal elements
of a covariance matrix are the variances of the individual elements, (7.48)
tells us that the individual variances of the estimates x̃i obtained by some
arbitrary estimator are greater than or equal to the corresponding diagonal
term J −1 . The CRLB thus puts a lower bound on how small the variances
can be, regardless of how good the estimation procedure is.
218
noise n is coloured, with covariance matrix Σ. Using the same analysis as in
Sect. 7.3.2, except replacing E(b − Axo )(b − Axo )T with Σ, the covariance
matrix of xLS becomes
Note that this covariance matrix is not equal to J −1 from (7.50). Therefore
the variances on the elements of xLS in this case are necessarily larger than
the minimum possible given by the bound3 . Therefore using the ordinary
normal equations when the noise is not white results in estimates with sub–
optimal variances.
Using the above as the regression model, and substituting G−1 A for A and
G−1 b for b in (7.19), we get:
cov(xLS ) = E(AT Σ−1 A)−1 AT Σ−1 (b − Axo )(b − Axo )T Σ−1 AT (AT Σ−1 A)−1
= (AT Σ−1 A)−1 AT Σ−1 E(b − Axo )(b − Axo )T Σ−1 AT (AT Σ−1 A)−1
| {z }
Σ
= (AT Σ−1 A)−1 . (7.55)
Notice that in the coloured noise case when the noise is pre–whitened as in
(7.52), the resulting matrix cov(xLS ) is equivalent to J −1 in (7.50), which
is the corresponding form of the CRLB; i.e., the equality of the bound is
now satisfied.
3
However, it may be shown that xLS obtained in this way in coloured noise is at least
unbiased.
219
Hence, in the presence of coloured noise with a covariance matrix that is
either known or can be estimated, pre–whitening the noise before applying
the linear least–squares estimation procedure also results in a minimum
variance unbiased estimator of x. We have seen this is not the case when
the noise is not prewhitened.
Note from (7.56) that the value x which maximizes the conditional proba-
bility p(b|x) is precisely xLS . This follows because xLS is by definition that
220
value of x which minimizes the quadratic form of the exponent in (7.56).
Thus, xLS is also the maximum likelihood estimate of x. Variances of max-
imum likelihood estimates aymptotically approach the Cramer–Rao lower
bound as the number of observations m → ∞. However, specifically for the
linear LS case, the variances satisfy the CRLB for finite m, as we have seen
from (7.55).
||x||22 = xT x.
We may write this in the form xT Ix. The squared Mahalanobis distance is
given by replacing the I with a full–rank, positive–definite matrix Σ−1 , to
get
||x||2Σ−1 = xT Σ−1 x. (7.57)
T −1
We have seen in Sect. 4.2 that the set of values {xx Σ x = 1} (for
which the Mahalanobis distance is constant) is an ellipse. From this, we
may suspect that the Mahalanobis distance varies with the direction of x.
To confirm this idea, we may be write (7.57) in the form
||x||2Σ−1 = xT V Λ−1 V T x,
221
∆
where the eigendecomposition of Σ = V ΛV T . If we define z = V T x, then
n
X z2
||x||2Σ−1 = z T Λ−1 z = i
,
λi
i=1
where z are the coefficients of x in the basis V . The squared Mahalanobis
distance may therefore be interpretted as measuring distances along the
eigenvector directions, in units of the respective eigenvalue. In the coloured
noise case the eigenvalues of Σ are not equal, so distances are measured
differently along each eigenvector direction.
222
component whose data matrix is Y , which we assume to be coloured. We
substitute the covariance matrices RX and RY = Y T Y for A and B in
(7.58) respectively. We assume RY is full rank. Then (7.58) becomes
RX v = λRY v
= λGGT v
In this example, we wish to use the EEG discriminate between healthy brain
activiy and that which represents some form of pathological brain activity,
arising from e.g., coma, concussion, epilepsy or others. In this vein, we
223
Figure 7.7. A illustration of the EEG.
wT X T X H w
w∗ = arg max T H .
w w X TP X P w
224
to zero we obtain
wT RP w 2RH w − wT RH w 2RP w
= 0,
(·)
where RH = X TH X H ; similarly for RP . The denominator of the above is
not evaluated since it is multiplied by 0 and is therefore irrelevant. Solving
the above, we get
RH w = λRP w, (7.60)
where λ = w RH w
T
225
Let the QR decomposition of A be expressed as
n
QT A = R =
R1 n (7.61)
0 m−n
From our previous discussion, and from the structure of the QR decompo-
sition A = QR, we note that Q1 is an orthonormal basis for R(A), and Q2
is an orthonormal basis for R(A)⊥ . We now define the quantities c and d
as T
T Q1 c n
Q b= T b= . (7.62)
Q2 d m−n
It is clear that x does not affect the “lower half” of the above equation. Eq.
(7.63) may be written as
xLS = R−1
1 c.
226
orthonormal matrix Q = Q1 Q2 , allowing a complete solution to the
LS problem.
227
7.8 Problems
2. On the website you will find a file A4Q5.mat which contains 2 variables
A and B. Each column bi of B is generated according to bi = Axo +
ni , i = 1, . . . m, where m in this case is 1000. For this problem, the
ni are coloured. Write a matlab program to estimate the xLS (i), i =
1, . . . m so that the estimates have the minimum possible variance.
Also jointly estimate the noise covariance matrix Σ.
Hint: This will require an iterative procedure as follows:
(a) Initialize the iteration index k to zero, and the noise covariance
matrix estimate Σ̂o to some value (e.g., I).
(b) Using the current Σ̂k , use the appropriate form of normal equa-
tions to calculate xLS (i), i = 1, . . . , m.
(c) For a more stable estimate, calculate the mean x̄ over all the LS
estimates.
(d) The noise vectors ni can then be estimated using x̄, from which
an updated Σ̂k can be determined.
(e) Increment k and go to (b) until convergence.
Also calculate the initial cov(xLS ) which assumes white noise, and also
the final covariance estimate obtained after convergence. Comment on
the differences.
b = Ax + n (7.65)
228
where k is an arbitrary constant > 0, and ai is the ith column of A.
Explain how to choose A so that the variance of each element of the
LS estimate xLS of x is minimum.
Hint: use the Hadamard inequality: For a positive definite square
symmetric matrix X ∈ Rn×n ,
n
Y
det(X) ≤ xii , (7.66)
i=1
with equality iff X is diagonal.
is minimized.
(b) What are the set xi , y i that minimize the minimum in (7.67)?
(c) What constraint is there on k so that the solution is unique?
5. Let A ∈ Rm×n , and b ∈ Rn . Find x so that ||A−xbT ||F is minimized.
Hint: For matrices A and B of compatible dimension, trace(AB) =
trace(BA).
6. Here we look at evaluating the spectrum of the vocal track for a speech
signal. Using a least–squares approach, determine the frequency re-
sponse H(z) of the vocal tract used to generate the speech sample
found in file SPF2.mat. Use samples 5600:6200 from the speech signal
for your analysis. Experiment with prediction orders between 8 and
12.
7. Assuming the noise is Gaussian and the assumptions A1 and A2 of
Sect. 7.3 hold, calculate a 95% confidence interval for a specified
element of xLS .
8. Given that we are provided a table of values for ti and corresponding
yi in the following equation, devise a method for determining the time
constant τ and the scalar value a:
ti
yi = a exp(− )
τ
where a is a real constant.
229
9. With regard to the common spatial patterns method, assume RH and
RP are given respively by
1 0.5 1 −0.5
RH = , RP = .
0.5 1 −0.5 1
What is the optimum weight vector w in this case? What is the ratio
of the variance of yH (t) to that of yP (t) for this choice of w?
(a) Develop this estimator and explain how to estimate the delays τi
when the noise is white.
(b) As above, when the noise has an arbitrary covariance Σ.
(c) Once the τi have been estimated, explain how to estimate the ai .
230
Chapter 8
In the previous chapters we considered only the case where A is full rank and
tall. In practical prolems, this may not always be the case. Here we present
the pseudo–inverse as an effective means of solving the LS problem in the
rank deficient case when the rank r is known. We also show that the pseudo–
inverse, (aka principal component analysis in this context), is also effective
in the near rank deficient case at controlling the large variances of the LS
solution that occur in this situation. Then we discuss an alternative method
of solving the rank–deficient LS problem using the QR decomposition. We
show that the pseudo–inverse is a generalized approach for solving any type
of linear system of equations under specified conditions.
Previously, we have seen that the LS problem determines the xLS which
solves the minimization problem given by
xLS = arg min ||Ax − b||22 (8.1)
x
where the observation b is generated from the regression model b = Axo +n.
The solution xLS is the one which gives us the best fit between the linear
231
model Ax and the observations b. For the case where A is full rank we saw
that the solution xLS which solves (8.1) is given by the normal equations
AT Ax = AT b. (8.2)
We have seen previously in Sect. 7.4, that even one small eigenvalue of the
matrix AT A destroys the desirable variance properties of the LS estimate
and introduces the potential for all elements of xLS to have large variance.
One small eigenvalue of AT A implies the matrix is poorly conditioned and
close to rank deficiency. The pseudo–inverse is a means of remedying this
adverse situation and can be very effective in reducing the error in the LS
solution.
232
where A+ is defined by (8.3). Further, the squared norm ρ2LS of the LS
residual r LS is given as
m
X
ρ2LS = (uTi b)2 . (8.6)
i=r+1
where
V T1
r w1
w= = x = V Tx (8.9)
T
n−r w2 V 2
and
U T1
r c1
= b = UT b (8.10)
m−r c2 U T2
and
Σr = diag[σ1 , . . . , σr ].
Note that we can write the quantity ||Ax − b||22 in the form of (8.7), since
the 2-norm is invariant to the orthonormal transformation U T , and the
quantity V V T which is inserted between A and x is identical to I.
1. Because of the zero blocks in the right column of the matrix in (8.8),
we see that the solution w is independent of w2 . Therefore, w2 is
arbitrary.
2. Note that for any vector y = [y 1 y 2 ]T , ||y||22 = ||y 1 ||22 + ||y 2 ||22 . Since
the argument of the left–hand side of (8.8) is a vector, it may therefore
be expressed as
233
Therefore, (8.8) is minimized by choosing w1 to satisfy
Σr w1 = c1 .
4. where the inverse exists because Σr consists only of the non-zero sin-
gular values. Combining our definitions for w1 and w2 together, we
have
−1
w1 Σr 0
w= = c
0 0 0
= Σ+ c (8.12)
V T xLS = Σ+ U T b
or
xLS = V Σ+ U T b
= A+ b (8.13)
X m
= (uTi b)2 .
i=r+1
234
8.2 Interpretation of the Pseudo-Inverse
But, for the specific case where m > n, we know from our previous discussion
on linear least squares, that
AxLS = P b (8.15)
where P is the projector onto R(A). Comparing (8.14) and (8.15), and
noting the projector is unique, we have
P = AA+ . (8.16)
This may also be seen in a different way as follows: Using the definition of
A+ , we have
AA+ = U ΣV T + T
V Σ U
Ir 0
= U UT
0 0
= U r U Tr (8.17)
We also note that it is just as easy to show that for the case m < n, the
matrix A+ A is a projector onto the row space of A.
235
8.2.2 Relationship of the Pseudo-Inverse Solution to the Nor-
mal Equations
In the full-rank case, these two quantities must be equal. We can indeed
show this is the case, as follows: We let
AT A = V Σ2 V T
be the ED of AT A and we let the the SVD of AT be defined as
AT = V ΣU T .
236
8.3 Principal Component Analysis (PCA)
xP C = V Σ+ U T b, (8.19)
where 1
σ1
1
σ2
..
.
Σ+ =
1
σr
,
0
..
.
0
237
where σr+1 . . . , σn are assumed small enough to cause trouble and are there-
fore truncated. In practice, the value of r is usually determined empirically
by trial–and–error methods, cross–validation, or through the use of some
form of prior knowledge.
The only difficulty with this principal component approach is that it in-
troduces a bias in xP C , whereas we have seen previously that the ordinary
normal equation xLS is unbiased. To see this biasedness, we let the singular
value decomposition of A be expressed as A = U ΣV T , and write
xP C = A+ b
= V Σ+ T
r U (Axo + n). (8.20)
Thus, because the noise has zero mean, the expected value of xP C may be
expressed as
E(xP C ) = V Σ+ T
r U (Axo ) (8.21)
= V Σ+ T T
r U (U ΣV xo )
= V Σ+ T
r ΣV xo
Ir 0
= V V T xo (8.22)
0 0
6= xo
Substituting (8.21) for E(xP C ), using (8.20) for xP C , and assuming that
E(nnT ) = σ 2 I, we get
cov(xP C ) = E(V Σ+ T T + T
r U nn U Σr V
= σ 2 V Σ+ T + T
r U IU Σr V
= σ 2 V (Σ+ 2 T
r ) V . (8.24)
This expression for covariance is similar to that for xLS , except that it
excludes the inverses of the smallest singular values and the corresponding
directions which have large variation. Thus, the elements of cov(xP C ) can
be significantly smaller than those for xLS , as desired.
238
Thus, we see that principal component analysis (PCA) is a tradeoff between
reduced variance on the one hand, and increased bias on the other. The
objective of any estimation problem is to reduce the overall error, which
is a combination of both bias and variance, to a minimum. In fact, it is
readily verified that the mean–squared error E(x̂ − xo )2 of an estimate x̂ of
a quantity whose true value is xo is given by
E(x̂ − xo )2 = b2 + σx2 ,
where b is the bias and σx2 is the variance of the estimate. If A is poorly
enough conditioned, then the improvement in the variance of xP C over that
of xLS is large, and the bias introduced is small, so the overall effect of PCA
is positive. However, as A becomes better conditioned, then the two effects
tend to balance each other off, and the technique becomes less favourable.
The choice of the parameter r controls the tradeoff between bias and vari-
ance. The smaller the value of r, the fewer the number of components in
A+ ; hence, the lower the variance and the higher the bias.
239
Figure 8.1. Scatter plots for the simulation example using both the normal equation solu-
tion and the principal component solution when A is poorly conditioned. The shrinkage
in the variation for the PC case is strongly evident.
240
The result for the normal equation case is
2.0915 1.8141 −4.0824
cov(xLS ) = 1.8141 1.5880 −3.5445
−4.0824 −3.5445 7.9695
whereas that for the pseudo–inverse case is
0.0011 −0.0026 −0.0012
cov(xP C ) = −0.0026 0.0093 0.0023 .
−0.0012 0.0023 0.0016
The means from the simulation for xLS and xP C are given by respectively
by
1.0126 1.0145
1.0102 , and 1.0119 ,
0.9757 0.9720
which are both close to the true values, as expected. Thus, we see that the
pseudo-inverse technique has significantly improved the variance in this case
when A is poorly conditioned. We also see that the error in the ordinary
LS estimate of the means is approximately equivalent to that of the PC
estimate. Thus it appears that in this example, the bias in the PC estimate
may be considered negigible, especially in view of the significant reduction
in variance.
241
We construct an example to show this is not always true. Suppose the rank
2 matrix A is defined as follows:
−0.4437 0.1500 −0.4119
0.4836 −0.1635 −1.5977
A= 0.6345 −0.2146
.
0.4580
−0.2555 0.0864 0.5244
A = QR
−0.468 0.849 0.047 0.239 0.948 −0.321 −0.457
0.510 0.237 −0.767 0.308 0 0 −0.822
=
0.669
(8.26)
.
0.253 0.638 0.286 0 0 1.524
−0.269 −0.398 0.048 0.875 0 0 0
Because of the zero in the R(2, 2) position, we see that R(A) 6= span[q 1 q 2 ]
as desired. Further, this QR decomposition is of no value in solving the LS
problem, because R is not full rank. The problem in (8.26) is that there are
no r columns (in this case 2 columns) of Q that can act as an orthonormal
basis for R(A).
where R11 ∈ Rr×r is upper triangular and non-singular and R12 is a rect-
angular matrix. In this case it is clear that that the rank-deficient QR
decomposition in the form of (8.27) indeed satisfies (8.25), where R(A) =
span[q 1 . . . , q r ]. The permutation matrix Π is determined in such a way so
that so that at each stage i, i = 1, . . . , r, the diagonal elements rii of R11
are as large in magnitude as possible, thus avoiding the degenerate form of
(8.26). But what is the procedure to determine Π?
To answer this, consider the ith stage, i < r of the QR decomposition with
column pivoting. Here, the first i columns have been annihilated below
the main diagonal by an appropriate QR decomposition procedure, such as
242
Householder. There exist an orthonormal Q and a permutation matrix Π
so that
R11 R12 i
QT AΠ = 0 R22 n−i (8.28)
i n−i
I 0 i
Q(i+1) = 0 Q̃ m−i . (8.29)
i m−i
where the Q̃ above is the matrix which eliminates the desired elements
of r 22 (1). Since Q̃ is orthonormal, the element r(i + 1, i + 1) in the top
left position of R22 after the multiplication is complete is therefore equal to
||r 22 (1)||2 . It is then clear that to place the elements with the largest possible
magnitudes along the diagonal of R, we must choose the permutation matrix
Π(i+1) at the (i + 1)th stage so that the column of R22 in (8.28) with
maximum 2-norm is swapped into the lead column position of R22 . This
procedure ensures that the resulting QR decomposition will have the form
of (8.27) as desired. Effectively, this procedure ensures that no zeros are
introduced along the diagonal of R until after stage r.
243
We can write (8.28) at the completion of ith stage in the form
R11 R12 i
Ã1 Ã2 Q1 Q2
= 0 R22 m−i (8.30)
i n−i i m−i
i n−i
where the tilde over the A– blocks indicates that the columns of A have
been permutated as prescribed by Π(i) . From our previous discussions,
Q1 is an orthonormal basis for R(A1 ) and Q2 is an orthonormal basis for
R(A1 )⊥ . It follows directly from the block multiplication in (8.30) that
the elements of the column r 22 (k) of R22 are the coefficients of the column
ã2 (k) in the basis Q2 . Thus, the column r 22 (1) which is annihilated after
the permutation step at the (i + 1)th stage corresponds to the column of Ã2
which has the largest component in R(Ã1 )⊥ .
Note that in the rank deficient case, there is no unique solution for (8.31).
Hence, unless an extra constraint is imposed on x, the LS solution obtained
by a particular algorithm can wander throughout the set of possible solu-
tions, and very large variances can result. As in the pseudo- inverse case,
the constraint of minimum norm is a convenient one to apply in this case, in
order to specify a unique solution. However, unlike the development of the
pseudo-inverse solution, we will see that the direct use of the QR decompo-
sition does not lead directly to the minimum norm solution xLS . However,
it is still possible to derive an elegant solution to the LS problem using only
the QR decompostion procedure. We now discuss how this is achieved.
244
Let
y r
ΠT x = . (8.32)
z n−r
The idea is to eliminate R12 ; then finding the xLS with minimum norm is
straightforward. There exists an orthonormal Z ∈ Rn×n such that
R11 R12 T 11 0
Z= (8.34)
0 0 0 0
where T 11 is nonsingular and upper triangular of dimension r×r. Therefore,
T T 11 0
Q AΠZ = (8.35)
0 0
245
Eq. (8.35) is called the complete orthogonal decomposition of the matrix A.
8.6 Problems
1. We have seen that the pseudo-inverse is very effective in solving least
squares problems when the X–matrix is poorly conditioned. Another
246
method of dealing with poorly–conditioned systems is regularization,
which we discuss in Chapter 10. Yet another method is to use principal
component analysis (PCA). With this approach, we replace X ∈ Rm×n
with a rank-r approximation X r , defined as
X Tr = V r C, (8.38)
247
248
Chapter 9
In this chapter, we discuss various practical issues that arise when solving
real problems using LS methods and in particular, the more recent latent
variable (LV) methods. The primary objective of model building we consider
in this chapter is the prediction of response values ŷ T corresponding to
new values xTN of our observations. We investigate three types of latent
variable methods, which are PCA (revisited), partial least squares (PLS)
and canonical correlation analysis (CCA).
Y = Xβ + E, (9.1)
where β ∈ Rn×k is the new notation for x and E is the error matrix, which is
the same size as Y . Eq. (9.1) is again referred to as the regression equation,
249
where we adopt the terminology that Y is regressed onto X through (9.1).
The notation we adopt here is standard in the statistical literature, where
latent variable methods are prevalent. On the other hand, the previous
notation of Chapters 7 and 8 is the most commonly used in the algebraic
literature.
Latent variable methods are founded on the idea that X (and often Y )
are expressed in a basis of dimension r, which is typically small relative
to n or k. Doing so alleviates the conditioning problem, resulting in large
variances of the parameter estimates, as we have seen previously. “Latent”
in this context implies “unseen”, and the latent variables in this context are
actually basis vectors which are used to represent X and/or Y . In the PCA
case, the latent variables are the principal eigenvectors of X T X; however,
for the partial least squares (PLS) and canonical correlation analysis (CCA)
methods which we discuss later, the latent variables are derived differently.
Since r is small relative to n or k, the latent variable basis is referred to
as being “incomplete”; i.e., X or Y cannot be represented in the LV basis
without error. However, by careful choice of the LV basis, this error can be
controlled while at the same time the error in the prediction of Y values
corresponding to new values of X can be considerably reduced.
The ith row of X is the ith observation of a set of n variables, and the
jth column contains the values of the jth variable over the set of m obser-
vations. For example, in a chemical reactor environment, each row of X
corresponds to a set of controllable (independent) inputs such as tempera-
ture, pressure, flow rates, etc. Each corresponding row of Y represents the
corresponding response values (outputs, or dependent variables) from the
reactor; i.e., output parameters containing concentrations of desired prod-
ucts, etc. Each row represents one of m different settings of the various
inputs and corresponding outputs.
250
need an accompanying test set, which is an independent set of X and Y
values used solely for evaluating the performance of the model. The model
must be trained using only the training set, and then evaluated using only
the test set. If the test set is included in the training procedure, then the
model will be trained using data it will be tested on, with the result that es-
timated performance is inflated upwards. The evaluation of the performance
of a model is a process which must be undertaken with care, and involves
implemention of a cross–validation process [25, 26]. Cross–validation is not
discussed in this volume, but is well–described in the references.
251
9.1 Design of X
252
n, then adding a column to X adds an additional variable to the LS problem
and increases X T X to size (n + 1) × (n + 1). From the Interlacing Property,
we have that λ1 (n + 1) ≥ λ1 (n) and λn+1 (n + 1) ≤ λn (n), where the number
in round brackets indictes the size of the matrix the respective λ is associated
with. Therefore the condition number (i.e., λ1 /λn ) never improves when
a column is added to X. In fact, the equality only holds under special
conditions, so in the general case, the condition number of X T X increases
by adding a column, and hence the variances of the LS estimates will degrade
under these circumstances. This behaviour is a manifestation of a general
principle in estimation theory that as the number of parameters estimated
from a given quantity of data increases, the variances of the estimates also
increase. This principle applies in particular to least–squares estimation.
There are two methods commonly employed for controlling the number of
variables. “Variables” are also referred to as “features” in the machine learn-
ing context. These methods are feature selection and feature extraction re-
spectively. With variable selection, variables are selected to be included in X
based on their statistical dependency with the responses of Y . For example,
the minimum redundancy maximum relevance (mRMR) method [27] selects
features iteratively. On the first iteration, the feature with the strongest
statistical dependance on Y is chosen. Then in subsequent iterations, the
feature with the best combination of maximum statistical dependance with
Y (relevance) and minimum statistical dependance (redundancy) with the
features chosen in previous iterations, is chosen. The process repeats un-
til the number of prescribed features is selected. This process produces a
set of features that are maximally predictive of the response variable and
as mutually independent as possible, meaning that the columns of X are
“discouraged” from being linearly dependent, thus resulting in a favourable
condition number.
The second method for controlling the number of variables is feature extrac-
tion, which is equivalent to the latent variable methods as discussed in this
section. Here, a prescribed number r of latent variables, each of which is
some form of optimal linear combination of all the available variables, are
calculated from the data. With this approach, the irrelevant variables would
be given small weights and contribute little to the latent variables. Thus
with the feature extraction method, all the variables are optimally combined
into a set of r latent vectors.
253
The choice of m, which is the number of observations, is more starightfor-
ward than choosing n. Generally speaking, the more observations the better,
so it is desirable to choose m as large as possible. The larger the value of
m, the larger the elements of X T X become, and consequently the smaller
are the resulting variances (see (7.34)). In fact, the variances decrease as
1/m. However, in most applications, collecting data is an expensive and
time consuming proposition, and so often we must make do with whatever
quantity of data is available. In the following, our prime motivation with
regard to LV methods is to predict reponses ŷ T from new data samples xˆTN .
We have already developed the PCA approach in Section 8.3 from the per-
spective of the pseudo–inverse, and also in Ch. 2 from the perspective of
data compression and denoising signals. Here we present PCA in the latent
variable context, which, as stated directly above, has to do with prediction.
254
The quantity T r is the latent variable representation of X, since by design,
projections of X along the vectors ti have maximum variance. Since we
assume there is a linear relationship between X and Y (otherwise Y cannot
be predicted from X using the methods discussed here), then it follows there
must also be a linear relationship between Y and T r . which can be expressed
in the form of a regression equation as follows
Y = T r β T + E, (9.4)
β T = (T Tr Tr )−1 T Tr Y . (9.5)
The determination of β T constitutes the training process for the PCA method.
ŷ T = tTr β T .
∆
Note that if we define the quantity X r = T r V Tr = XV V T , which is the
projection of the row space of X onto the latent variable subspace, then
according to Property 12 Chapter 2, there is no other r-dimensional basis
for which the quantity ||X − X r ||2 is smaller. This is the motivation for
choosing the eigenvectors as the latent variables.
255
of R(X). On the other hand, for the present PCA approach, X and Y are
related through (9.4), where we have replaced X with T r , which is valid since
T r is a principal basis for R(X) through (9.3). The quantity Ŷ = Xβ in
this case is given through (9.4) and (9.5) as Ŷ = Xβ T = T r (T Tr T r )−1 T Tr Y .
The quantity T r (T Tr T r )−1 T Tr is also a projector onto the PS of R(X). We
have seen in Sect. 3.2 that the projector is unique, regardless of its formu-
lation. Thus, Ŷ in the PCA case is also the projection of Y onto the PS of
R(X) and so the pseudo–inverse and PCA both give identical predictions.
The PCA latent variables are determined solely from X and are independent
of the Y variables, and capture the directions of major variation in X
only. For the PLS and CCA methods on the other hand, we form a set of
latent variables, one in the X–space and another in the Y –space that have
maximum covariance in the PLS case, or correlation in the CCA case. By
forming latent variable sets in this manner that incorporates both datasets,
we expect that the PLS and CCA methods might be better at predicting Y
corresponding to a new set of X values.
Av i = σi uu (9.7)
T
A ui = σi v i . (9.8)
1 T
rxy = x y, (9.9)
m
256
whereas the sample correlation estimate ρxy is given as
xT y
ρxy = . (9.10)
||x||2 ||y||2
where θ is the angle between the two vectors. Thus from (9.9) we note that
the covariance depends on ||x||2 , ||y||2 and θ. Comparing (9.11) with (9.10),
we have
ρxy = cos(θ).
and so ρxy , unlike rxy , depends only on the angle between the vectors and
is independent of the norms. Thus ρxy lies in the range −1 ≤ ρxy ≤ 1 and
gives an idea how closely the random variables X and Y agree with other
on average.
RXY = X T Y . (9.12)
With the PLS and CCA methods, we create a set of r orthogonal basis
vectors for each of the X and Y datasets. We refer to the respective
257
subspaces formed by these bases as SX and SY . Each are of dimension
r ≤ min(n, k). The latent variable basis vectors for X and Y are denoted ti
and pi , i = 1, . . . r respectively. For the PLS case, we choose the t1 ∈ SX
and p1 ∈ SY so that their covariance; i.e., the quantity tT1 p1 is maximum.
Then t2 and p2 are chosen so they too have maximum covariance, under
the constraint they are each orthogonal to their counterparts of the first set.
The remaining basis vectors are found in a similar manner. The CCA case
is similar, except we choose to maximize correlations instead of covariances.
By choosing the latent variables in this manner, we provide the best possible
fit between the X and Y subspaces and therefore new values xN of X are
more likely to lead to “good” predictions of the corresponding Y –values.
For the time being, we consider only the covariance or PLS case – the CCA
case is addressed later. To determine t ∈ SX and p ∈ SY with maximum
covariance, we identify unit–norm vectors s ∈ Rn and q ∈ Rk , such that the
covariance between t = Xs and p = Y q is maximum. Posing the problem
in this manner guarantees the solutions t∗ and p∗ belong to their respective
subspaces. This problem may be expressed in the form of the following
constrained optimization problem:
We differentiate (9.15) with respect to s and q. With regard to the first term,
using a procedure similar to that outlined in Sect. 2.8, it is straightforward
to show that
d T
s Rxy q = sT Rxy
dq
d T
s Rxy q = Rxy q.
ds
Differentiation of the second term of (9.15) with respect to s is straightfor-
ward using the chain rule. It is readily verified that
d h 1
i s̄
γi,1 1 − (sT s) 2 = −γi,1 .
ds ||s̄||2
258
where s̄ is a vector with the same direction as s but whose 2–norm is arbi-
trary. Therefore we assign the vector s = ||s̄s̄i as the normalized version of
i ||
s̄ having unit 2-norm. A corresponding result holds for the last term:
d h 1
i q̄
γi,2 1 − (q T q) 2 = −γi,2 .
dq ||q̄||2
q̄
We also define q = ||q̄ || which is the normalized version of q̄.
respectively, where we have transposed both sides of the second line above.
Comparing (9.16) and (9.17) to (9.7) and (9.8), the latter of which are the
defining relations for the SVD, we see that the stationary points of (9.14)
are respectively the right and left singular vectors of Rxy . Let the SVD of
Rxy be expressed as Rxy = U ΣV T . Therefore the optimal set satisfying
(9.14) are the first r right and left singular vectors U r = [u1 . . . ur ] and
V r = [v 1 . . . v r ] respectively. Note that the required orthogonality property
of the solutions follow directly from the orthonormality of U and V .
T r = [t1 . . . , tr ] = XU r , (9.18)
and
P r = [p1 . . . , pr ] = Y V r . (9.19)
Note that both T r and P r are both m × r. The T r and P r are the desired
latent variable bases for SX and SY respectively.
tTi pi = uTi X T Y v i
= uTi Rxy v i
= uTi U ΣV T v i
= σi . (9.20)
259
It is seen that the r maximum covariances betwen X and Y are the largest
r singular values of Rxy , which are the σi . The directions in SX and SY
which result in this largest covariance are given by ti = Xui and pi = Y v i
respectively.
The development for the CCA case is identical to that of the PLS case,
except that we use the variables X̃ and Ỹ in place of X and Y throughout
the development above. Because PLS is related to the covariance between
X and Y , the PLS latent variablces are formed from a combination of the
directions of major variation in both X and Y , as well as the angles between
the latent vectors in the two respective subspaces. Because CCA is derived
from the correlation between the variables, the CCA latent variables are de-
termined solely from the angles between the subspaces and are independent
of the directions of major variation. This is a direct consequnce of the fact
the columns of both X̃ and Ỹ are orthonormal. In the CCA case only, it
can be shown [1] that the σi , i = 1, . . . , r are the cosines of the r angles
specifying the relative orientations between SX and SY . In this vein, it may
be shown (Problem 3) that 0 ≤ σi ≤ 1.
The steps involved in the evaluation of the latent variables for the PLS or
CCA methods can be sumarized as follows:
Calculate the SVD of the R–matrix above to give the values U r and
V r.
The presentation here for identifying the PLS and CCA latent variables is
quite different from the usual treatment in the literature. Most methods
e.g.[20, 22] use the nonlinear iterative partial least squares (NIPALS) algo-
rithm for extracting the latent variables. However, the method presented
260
here using the SVD on the matrix Rxy as in (9.14) yields identical LVs, and
affords a simpler presentation.
We now consider the problem where the latent variables for either one of the
two methods discussed are available, and we wish to determine the row(s)
of prediction estimates ŷ T corresponding to a new observation (row(s)) xTN
of X–values. The prediction procedure for both the PLS and CCA methods
are identical given the set of their respective latent variables. ’
β y = (P Tr P r )−1 P Tr Y . (9.23)
261
latent variables have been computed as described in the previous section,
we compute predicted values ŷ T for Y corresponding to new values xTN of
X in the following manner:
Given β P from (9.21), transform tTN from SX into SY to give pTN using
(9.22), representing the predicted values of Y in the latent subspace.
Transform the pTN from the latent space into the Y –space using (9.24),
where β y is given by (9.23), to yield the predicted values ŷ T of Y .
A last topic for this section is to introduce the terminology “loadings and
scores”, with respect to latent variables that is in common use in the sta-
tistical literature. The variables X and Y are typically represented in their
rank-r latent variable bases as
X = T P T + Ex
Y = U QT + E y .
The matrices T and U are bases for the column spaces of X and Y re-
spectively, whereas P and Q are the corresponding row space bases. The
matrices T and U are referred to as “scores”, whereas P and Q are referred
to as “loadings”.
262
With this present simulation scenario, the singualr values of X are typically
in the range between 1 and 10. To introduce near linear dependence among
the columns of X and corresponding poor conditioning (which is necessary
to illustrate the effectiveness of LV methods), we perfrom an SVD on X,
and replace the three smallest singular values of Σ with the values 1 × 10−8 .
A new matrix X is then reassembled from its SVD components using the
modified version of Σ.
Y = XB + σE (9.25)
where
E ∈ Rm×k is additive Gaussian noise, also with zero mean and unit
variance.
263
Then after 100 iterations of the inner loop, a new iteration of the outer loop
proceeds, where new values of X and B are calculated and 100 iterations
of the inner loop are repeated for these new values of X and B. After 50
iterations of the outer loop, all the normalized prediction errors are averaged
together. In this manner, the final results reflect LV performance over many
settings of X and β values. We show the normalized prediction errors
vs. r (the number of latent components) for SNR = 6 and 20 dB in the
following figures, for each of the three LV methods discussed. Results from
the ordinary normal equations are not shown since their accuracies are very
poor, due to the poor conditioning of X.
The nominal rank for X is 5, since there are 8 columns but the three smallest
singular values were set to very small values, thus making the effective rank
equal to 5. It may be seen that in all cases (except perhaps CCA at 6 dB
SNR) the prediction error drops to a plateau, whose value depends mostly
on the SNR. In this case, the dimensionality of the latent variable subspaces
is high enough to form an accurate model, and this prediction error is low.
Below r = 5, the prediction errors rise sharply due to underfitting. In these
cases, PLS uniformly performs better than PCA as expected, since PLS is
inherently a more flexible model. The CCA performance is approximately
comparable to that of PLS, except CCA performance drops off for low values
of SNR. It is apparent that the prediction performance of all three methods
is highly dependent on the SNR value.
264
Figure 9.1. An overview of the simulation process used to compare the three forms of LV
methods.
265
SNR=6 dB
0.9
PCA
PLS
CCA
0.7
0.6
0.5
0.4
1 2 3 4 5 6
r, no. of latent components
SNR=20 dB
0.9
PCA
0.8 PLS
CCA
Average relative error in prediction
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2 3 4 5 6
r, no. of latent components
266
9.4 Problems
267
268
Chapter 10
Regularization
The methods we have discussed until now improve stability of the model
by purposely reducing the effective rank of X and or Y . An alternative
approach is regularization, which improves the modelling process by incor-
porating prior information into the model in some form. In the LS example,
regularization helps to mitigate the effects of poor conditioning, which as
the reader may recall, occurs when X T X is poorly conditioned. This leads
to large variances in the LS estimates, as we have seen in previous chapters.
LV methods are one form of regularization, since they impose prior infor-
mation by assuming low rank approximations to X and Y . All forms exist
to mitigate the effects of poor conditioning, which results when the columns
of X become close to linear dependence, resulting in near rank deficiency
and hence at least one small eigenvalue. This implies that the variables cor-
responding to each column are too dependent on one another, or in other
words, there is not enough joint information in the columns/variables to
269
create a stable model. Regularization imposes additional prior information
on the solution to help mitigate this situation.
270
be shown [25] there is a corresponding value of t in (10.2) for which the
solutions are identical.
Thus, the ridge regression method effectively adds the value λ to the diagonal
elements of X T X. Recall from the Properties of Eigenvalues in Chapter 2,
adding a constant term λ to the diagonal elements of a matrix has the effect
of adding the same value to each of its eigenvalues; i,e., each λi is replaced
by λi + λ1 . Further recall the discussion on the condition number K2 (A) of
a matrix A matrix in Ch.4. When solving a system of equations Ax = b,
the quantity K2 (A) is the worst–case magnification factor by which errors
in A or b appear in the solution x. K2 (A) in the 2-norm sense is given by
|λ1 |
K2 (A) =
|λn |
i.e., the ratio of the absolute values of largest to smallest eigenvalues of
A. After regularization, the modified condition number K20 (A) therefore
becomes
|λ1 + λ|
K20 (A) = .
|λn + λ|
In a poorly conditioned LS problem, λ1 λn , and so if λ is significantly
greater than λn , K20 (X) can be significantly less than K2 (X) without sig-
nificantly perturbing the matrix X T X, and therefore also the integrity of
the solution.
Xβ n = X(X T X)−1 X T y,
1
A clarification on notation: a λi (with a subscript) denotes an eigenvalue, whereas λ
without a subscript denotes the ridge regression regularization parameter.
271
where β n is the normal equation estimae of β. When we substitute the SVD
of X = U ΣV T into the above, we obtain the simplified form
Xβ n = U U T y,
which, when we apply the outer product rule for matrix multiplication, may
be expressed as
n
X
n
Xβ = ui uTi y, (10.4)
i=1
It is interesting to note that the PCA solution β pca can also be expressed in
the form
Xβ pca = U r U Tr y
where U r = [u1 . . . , ur ]. Using the outer–product rule for matrix multipli-
cation, this can be written in the form
r
X
Xβ pca = ui uTi y. (10.5)
i=1
Thus, by comparing (10.4) and (10.5), we see that the PCA solution is
similar in form to the normal equation solution, but PCA applies a hard
thresholding procedure to eliminate the components [ur+1 . . . un ] which are
associated with the smaller singular values of X.
Now we look at the ridge regression solution in the light of (10.4) and (10.5).
From the ridge regression estimate (10.3), we get
rr
X β̂ = X T (X T X + λI)−1 Xy.
Because the inner matrices are diagonal, we can express the above using the
outer–product rule for matrix multiplication as
n
σi2
rr X
X β̂ = ui 2 uTi y. (10.6)
i=1
σi + λ
Since λ > 0, the term in the round brackets above is always less than
1. For suitably–chosen λ, this term is close to one for the larger singular
272
values and small for the small singular values. By comparing (10.5) and
(10.6), we see that the ridge regression approach is similar to the PCA
approach, but ridge regression applies a soft instead of a hard thresholding
function to suppress the effect of the components that are associated with the
small singular values. It may be seen from (10.4) that the ordinary normal
equation approach on the other hand applies no thresholding procedure at
all.
We note that both the PCA and ridge regression methods involve a process
which forces particular singular values of X to become smaller. This same
phenomenon also holds for other forms of regularization which are not dis-
cussed here. This process of reduction of the eigevalues is the origin of the
term “shrinkage”, which is a term often used in the machine learning and
statistical literature to describe the regularization procedure.
i.e., an identity matrix with the first upper diagonal replaced with -1. Then
the elements of z = Bx measure differences in successive elements of a
vector x. If the solution is to be smooth, then we want ||z||22 to be small. We
can therefore modify the LS objective function to incorporate a smoothness
constraint by adopting the following form:
X T X + λB T B β = X T y.
273
s
The solution β̂ to this form of normal equations penalizes a non–smooth
solution. As in (10.1), the above may also be expressed in the form
s
β̂ = arg min ||Xβ − y||22
β
subject to ||Bβ||2 ≤ t.
274
10.3 Sparsity regularization
275
Figure 10.1. Curves of ||x||p vs. x for various values of p for the one–dimensional case.
Figure 10.2 shows the elliptical contours of the joint confidence regions of
the estimates β lasso for various values of α, as discussed in Sect. 7.4. These
ellipses are the contours for which (β lasso − β o )T X T X(β lasso − β o ) = k,
where the curves for various values of k are shown. The solution to (10.8)
corresponds to the case where the ellipse with the lowest possible k just
touches the constraint function ||β||1 ≤ t, which is the diamond region in
the top figure of Fig. 10.2. As can be seen, the ellipse touches the constraint
function on the β1 axis, where β2 = 0, thus inducing a sparse solution in the
present two dimensional system. When this situation is extended to multi-
ple dimensions, the “pointy” nature of the 1-norm constraint encourages a
solution along one of the co–ordinate axes, where most of the elements of β
are zero, again promoting sparsity in the solution.
On the other hand, we see from the lower figure in Fig. 10.2 that the
ellipse touches the circular constraint function ||β||2 ≤ t at a point away
from a co–ordinate axis, thus admitting a small value of β2 to exist in the
solution. From this example we see that a 2–norm constraint is ineffective
at encouraging a sparse solution.
276
Figure 10.2. Illustration of the effect of a 1-norm penalty in the two dimensional case. The
interior of the diamond region in the top figure is the set of points for which ||β||1 ≤ t,
whereas the circular region in the lower figure corresponds to ||β||2 ≤ t. The ellipses are
the contours of the LS error function.
277
Superposition of Gaussian pulses
1.6
1.4
1.2
0.8
0.6
volts
0.4
0.2
-0.2
-0.4
10 20 30 40 50 60 70 80 90
time(samples)
Figure 10.3. An ERP waveform consisting of a superposition of Gaussian pulses for a lasso
simulation.
278
Superimposed simulated ERP responses
2
1.5
1
microvolts
0.5
-0.5
-1
0 10 20 30 40 50 60 70 80 90 100
time (samples)
0.8
Amplitude
0.6
0.4
0.2
-0.2
10 20 30 40 50 60 70 80 90 100
time
Figure 10.5. A dictionary of Gaussian pulses, each with its own unique delay.
279
in D. The vector a ∈ Rn is the vector of amplitudes associated with each
pulse. The vector n is the additive noise. A naive approach for constructing
the model Da is then to solve the following optimization problem:
280
1.2
0.6
microvolts
0.4
0.2
-0.2
-0.4
-0.6
0 10 20 30 40 50 60 70 80 90 100
time (samples)
Figure 10.6. A lasso reconstruction of an ERP signal. Shown are the original noise–free
ERP signal, its noisy version, and the lasso reconstruction.
10.4 Problems
281
282
Chapter 11
Toeplitz Systems
In this chapter derive two different O(n2 ) algorithms for solving Toeplitz
systems. We start from the idea of forward and backward linear prediction
of an autoregressive process, which leads to a Toeplitz system of equations if
the process is stationary. These equations are then solved using a recursive
technique, where the dimension of the system is increased until the desired
size is obtained.
283
Our ostensible objective of this section is to exploit the structure of a
Toeplitz system to develop a fast O(n2 ) technique for computing its so-
lution. This compares with O(n3 ) complexity when Gaussian elimination
or Cholesky methods are used to solve the same system. However, it turns
out this fast algorithm is not the only dividend we receive in pursuing these
studies. In developing the Toeplitz solution, we are also lead to new insights
and very useful and interesting interpretations of AR systems, such as lattice
filter structures, and other special techniques for signal processing. These
structures lead to very powerful methods for adaptive filtering applications.
There are also moving average (MA) processes. These are the output of an
all-zero filter in response to a white-noise input. There are also autoregressive-
moving average (ARMA) processes which are the output of a filter with both
poles and zeros, in response to a white noise input. MA and ARMA pro-
cesses are not directly considered in this lecture. The interested reader is
284
referred to [29] for a deeper examination of these subjects.
Consider an all-pole filter driven by a white noise process w(n) with output
x(n). We define the denominator polynomial H(z) of the filter transfer
function as
XK
H(z) = 1 − hk z −k . (11.1)
k=1
1
Thus, the filter transfer function is H(z) . Taking z transforms of the input-
output relationships we have:
1
X(z) = W (z) or X(z)H(z) = W (z). (11.2)
H(z)
Converting the above relationship back into the time domain, and realizing
that multiplication in the z-domain is convolution in time, we have using
(11.1)
XK
x(n) − h(k)x(n − k) = w(n) (11.3)
k=1
or
K
X
x(n) = h(k)x(n − k) + w(n). (11.4)
k=1
This equation offers a very useful interpretation in that the present output
x(n) is predictable from a linear combination of past outputs within an
error w(n). This property is derived directly as a consequence of the all-
pole characteristic of the filter.
If w(n) is small in comparison to the first sum term on the right of (11.4)
most of the time, then the the predicted value x̂(n) of x(n) given from (11.4)
as
XK
x̂(n) = h(k)x(n − k) (11.5)
k=1
285
Figure 11.1. The prediction–error filter configuration.
Using this form we can define a prediction error filter (PEF) by re–arranging
(11.4) as
2
With the PEF filter, the input is an autoregressive process x(n) and the output is a
white noise process w(n). The transfer function of the PEF is W (z)
X(z)
= 1 − P (z). On the
other hand, the input to the AR generating filter is w(n) and the output is x(n), as shown
in Fig. 11.2. The AR transfer function can be written in the form 1−P1 (z) . Therefore the
AR generating filter and the PEF are inverses of each other.
286
Figure 11.2. The AR generating filter configuration, which is the inverse of the PEF con-
figuration.
K
X
x(n) = hk x(n − k) + w(n), n = K, . . . , N, N K. (11.7)
k=1
xp = Xh + w (11.8)
where
xK+1 wK
xK+2 wK+1
xp = w =
.. ..
. .
xN wN
xK ... x1
h1
xK+1 . . . x2
X= h = ... .
.. ..
. .
hK
xN −1 . . . xN −K
We see that (11.8) is a regression equation, where the variables x are re-
gressed onto themselves. This is the origin of the term “autoregressive”.
As discussed, we choose the coefficients h to minimize the prediction error
287
power. The coefficients hLS found in such a manner minimize ||xp − Xh||22
and are therefore given as the solution to the normal equations:
X T XhLS = X T xp . (11.9)
If the sequence x(n) is stationary and ergodic, the matrix E(X T X) becomes
r0 r−1 r−2 . . . r−K+1
r1 r0 r−1
T
. .
E(X X) = r2 r1 r0 . =R
(11.11)
.. . .. . ..
.
rK−1 r1 r0
∆
where ri = E(xn+i xn ) is the autocorrelation function of x at lag i, and R is
the covariance matrix of x(n).
RhLS = r p . (11.13)
Eq. (11.13) is the expectation of the normal equations used to determine the
coefficients of a stationary AR process. The finite–sample version of (11.13)
is referred to as the Yule– Walker equations. It is apparent from (11.11)
that (11.13) is a Toeplitz symmetric system of equations. We describe an
efficient O(n2 ) method of computing the solution to (11.13). But in the
process of developing this solution, we also uncover a great deal about the
underlying structure of AR processes.
Since (11.13) involves expectations, it only holds for the ideal case when
an infinite amount of data is available to form the covariance matrix R of
288
{x}. In the practical case where we have finite N , the normal equations
corresponding to (11.13) are not exactly Toeplitz. However, in the following
treatment, we still treat the finite case as if it were exactly Toeplitz. While
this form of treament does not necessarily minimize the predicition error for
the finite case, it imposes an asymptotic structure to the finite N solution
which tends to produce more stable results.
E(X T xp ) = r p = RhLS ,
We can combine (11.13) and (11.14) together into one matrix equation as
follows:
r0 r−1 . . . r−K σ2
1st row given by (11.14)→ 1
..
−h1 0
r1 r0 . r−K+1
.. .. .. .. ..
. = .
remaining rows given by
. . .
.. .. .. ..
r p −RhLS =0 . .
.
(11.13)
. r−1
−hK 0
rK r1 r0
(11.15)
These are called the forward prediction-error equations. They are developed
directly from (11.4) where we have predicted x(n) in a forward direction
from a linear combination of past values.
289
The Backward Prediction Equations
290
The resulting backward prediction error equations are given as
−hK 0
r0 . . . r−K . ..
.. .. .. .. .
. . . = . (11.18)
−h1 0
rK r0
1 σ2
Equations (11.15) and (11.18) may be solved jointly using the Levinson–
Durbin recursion, which requires only O(n2 ) flops, and is explained as fol-
lows.
291
2
where σ(m−1) is the prediction error power at stage m − 1. The backward
equations are written from (11.18) as
(m−1)
−hm−1 0
r0 . . . r−m+1 .. ..
.. .. .. .
= . . (11.20)
. . .
−h(m−1)
0
rm−1 r0 1
2
σ(m−1)
1
292
2
∆(m−1)
σ(m−1)
0 0
..
=
.. + ρm
. .
0 0
2
∆(m−1) σ(m−1)
Looking only at the forward portion of (11.23), because of the zero at the
bottom of the vector of unknowns on the left of the equals sign, the first
m−1 equations of (11.23) are identical to those of (11.19). Only the last row
of the mth-order system is different; because this equation does not occur in
the (m − 1)th–order system, we denote the right-hand side of this equation
as the special quantity ∆(m−1) , which is defined from (11.23) as
1
X
∆(m−1) = − ri h(m−1) (m − i). (11.24)
i=m
∆
In the above, h(0) = 1. We see that the backward portion of (11.23) is
analogous to the forward part, except everything is reversed top-to-bottom.
(m−1)
hi (m) = hi (m−1) + ρm hm−i , i = 1, . . . , m − 1 (11.25)
hm (m) = −ρm . (11.26)
Notice that (11.25) gives the mth–order coefficients in terms of the (m−1)th–
order coefficients and the quantity ρm . Thus, once ρm is determined, we can
complete the iteration from the (m − 1)th to the mth stage. To determine
293
ρm , we compare the right-hand sides of (11.22) and (11.23), to obtain
∆(m−1)
ρm = − 2 . (11.30)
σ(m−1)
Equations (11.30), (11.29), (11.25) and (11.26) define the recursion from the
(m − 1)th to the mth step. The induction process is complete by noting that
∆(0) = r1 , h0 = 1, and σ02 = r0 .
We now summarize the LDR. Starting with the above initial conditions, and
m = 1, we proceed as follows:
294
11.1.3 Further Analysis on Toeplitz Systems
There are many significant repercussions which result from this previous
analysis. In the following, we present several aspects of Toeplitz system
analysis as it relates to the field of signal processing.
K
X
x(n) = h(k)x(n − k) + w(n).
k=1
The quantity K is the true order of the system, whereas m is an index which
iterates as m = 1, . . . , K, according to the LDR. Thus, at the mth stage of
the LDR, equation (11.4) is effectively replaced by
m
X
x(n) = h(k)x(n − k) + w(n). (11.31)
k=1
2
The quantity σ(m) is the power of the noise term w(n) at the mth stage; i.e.,
2 2
σ(m) = E(w (n)). Thus, with the initial value m = 1, only one past value of
x is used to predict the present value, when in fact K past values are required
to predict as accurately as possible. Thus, for m = 1, the prediction process
indicated by (11.31) is not very accurate and the resulting noise power σ(1) 2
295
The Partial Correlation Coefficients ρm
The quantities ρm are significant. They are referred to as the partial corre-
lation coefficients, or by analogy of (11.29) to power reflected from a load
on a transmission line, they are sometimes referred to as the reflection coef-
ficients. From (11.29), ρm indicates the reduction in prediction error power
in going from the (m − 1)th to the mth stage. In accordance with previous
discussion, ρm = 0 for m > K.
K
Y
1 − |ρm |2
det R = r0
m=1
YK
2
= r0 σ(m) (11.32)
m=1
From (11.32), we see that if |ρm | > 1 for any m, then det R could be less than
zero. However, we know that covariance matrices are positive semi-definite
and must have determinants greater than or equal to zero. Furthermore, it
may be shown that if |ρm | > 1, then some of the poles of the all-pole filter
which generates the observed AR process are outside the unit circle. This of
course will lead to instability and a non-stationary process whose covariance
matrix does not exist. Therefore we must have |ρm | ≤ 1 for m = 1, . . . , K.
However, with a finite sample of data, it is not guaranteed that the LDR
will always yield values ρm such that |ρm | ≤ 1.
296
11.1.4 The Burg Recursion
Here we look at the forward and backward prediction errors in further detail.
The forward prediction errors (i.e., the output of the forward PEF) at the
mth stage, referred to as wf,m (n) can be inferred through (11.7) as:
m
X
wf,m (n) = x(n) − h(m) (k)x(n − k)
k=1
m
X
= a(m) (k)x(n − k) m = 1, . . . , K (11.33)
k=0
where
1 k=0
a(k) = (11.34)
−h(k) k = 1, . . . , m.
Likewise, the backward prediction errors at the mth stage, which are the
outputs of the backward PEF, denoted wb,m (n), can be inferred through
(11.17)
m
(m)
X
wb,m (n) = am−k x(n − k). (11.35)
k=0
For ease of notation, we define the forward prediction error power at the
mth stage as Pf,m (formerly σ(m)2 ) and the backward prediction error power
To accomplish this, we must express Pf,m and Pb,m in terms of ρm . This may
be done by first developing new expressions for the forward and backward
prediction errors wf,m (n) and wb,m (n) in terms of ρm .
297
Substituting (11.34) into (11.25) and (11.26), we have
(
(m−1) (m−1)
(m) ai + ρm am−i , i = 1, . . . , m − 1
ai = (11.37)
0, i>m
a(m)
m = ρm (11.38)
(m)
a0 = 1. (11.39)
298
Also, by substituting k for m − k in the second summation term, we get the
forward prediction error wf,m−1 (n). Therefore (11.43) may be written as
wb,m (n) = wb,m−1 (n − 1) + ρm wf,m−1 (n). (11.44)
Equations (11.41) and (11.44) are the desired expressions for wf,m (n) and
wb,m (n) in terms of the coefficient ρm . Figure 5 shows how the forward
and backward prediction errors at order m are formed from those at order
(m−1) We now return to the discussion on choosing ρm to minimize (11.36).
By definition,
N
1 X
Pf,m = (wf,m (n))2 (11.45)
N −m
n=m+1
and,
N −m+1
1 X
Pb,m = (wb,m (n))2 (11.46)
N −m
n=1
where N is the length of the original observation x(n). Differentiating
(11.45) with respect to ρm , we get
∂Pf,m 2 X ∂wf,m (n)
= wf,m (n) · (11.47)
∂ρm N −m n ∂ρm
2 X
= wf,m (n) · wb,m−1 (n − 1) (11.48)
N −m n
where (11.41) was used to evaluate the derivative in (11.47). Substituting
(11.41) into (11.48) to express all quantities at the (m − 1)th order, we have
∂Pf,m 2 X
= (wf,m−1 (n) + ρm wb,m−1 (n − 1)) wb,m−1 (n − 1)
∂ρm N −m n
2 X
= wf,m−1 (n) · wb,m−1 (n − 1)
N −m n
2 X
+ ρm (wb,m−1 (n − 1))2 . (11.49)
N −m n
∂P
In a similar way, we can determine ∂ρb,mm
as
∂Pb,m 2 X 2 X
= wf,m−1 (n)ẇb,m−1 (n − 1) + ρm (wf,m−1 (n))2 .
∂ρm N −m n N −m n
(11.50)
299
Substituting (11.49) and (11.50) into (11.36), and setting the result to zero,
we have the final desired expression for ρm :
−2 N
P
n=m−1 wf,m−1 (n)wb,m−1 (n − 1)
ρm = PN 2 PN 2
. (11.51)
n=m−1 (wf,m−1 (n)) + n=m−1 (wb,m−1 (n − 1))
Notice that this expression for ρm at the mth stage is a function only of the
prediction errors at stage m − 1. Hence, the quantity ρm may be calculated
using only the signals available at stage m − 1. The mth-order coefficients
may be immediately determined from (11.37)-(11.39) once ρm is known.
Initialize:
m=0
(m)
a0 = 1
{wf,m } = {x}
{wb,m } = {x}
Pf,m = r0
m=1
Iterate for m = 1, . . . , K
determine ρm from (11.51)
calculate wf,m (n) and wb,m (n) from (11.41) and (11.44) respectively
calculate the mth-order coefficients from (11.37)-(11.39)
if desired, calculate Pf,m from (11.29)
end.
As mentioned previously, this new procedure has the advantage that |ρm | ≤
1 for any reasonable sample size of data. To prove this point, consider the
300
matrix
wf,m wb,m
W = (11.52)
wb,m wf,m
where wf,m = [wf,m (m + 1), . . . , wf,m (N )]T , and wb,m = [wb,m (m), . . . , wb,m (N − 1)]T .
Equations (11.41) and (11.44) lead to a very interesting and useful interpre-
tation of the earlier prediction error filter structure.
301
'
..
(V\
h
W, Y\ l )
b'1
I t
Figure 11.3. Representation of a single stage of a lattice filter, corresponding to eqs. (11.41)
and (11.44)
Figure 11.4. The prediction error filter implemented as a cascaded lattice filter. Each
section is implemented as shown in Fig. 11.3.
302
for a given number of bits, the ρm ’s can be represented with higher precision
(K)
than the ak coefficients; i.e., they are less sensitive to finite precision effects
and the lattice filter of Fig. 11.4 is generally the preferred structure.
rK r0 (K) 0
a K
303
where U is an undetermined upper-triangular matrix whose value will soon
become apparent. We denote the second matrix on the left of (11.56) as A,
and the matrix product on the left as C. We premultiply each side of (11.56)
by AT . The right-hand side of this product is AT C. The matrices AT and
C are both upper triangular. This follows because AT is upper triangular
from its definition; that C is upper triangular follows from the right-hand
side of (11.56). Since AT has one’s on its main diagonal, the diagonal entries
of the product AT C are the same as those of C. We therefore have
PK
PK−1 U0
T T
A C = A RA =
. .. ∆
= P. (11.57)
0 P1
P0
The first reason why (11.57) is significant is as follows. We define the upper-
triangular matrix B as
B = AS (11.58)
1
where S = P − 2 . Then, from (11.57), we have
R = A−T P A−1
= B −T B −1 (11.59)
304
where N K. It is clear that R = X T X. We can perform a QR decom-
position of X as
X = QU (11.61)
where U is the upper-triangular factor and Q has orthonormal columns.
Then,
R = UT U. (11.62)
AT RA = AT X T XA = P = diag. (11.63)
This idea may be extended even further. By comparing the operation in-
volved in the matrix product XB with (11.35), we see that the columns of
XA are the backward prediction errors wb,m (n), m = K, . . . , 1. We can
thus write
XB = [w̃b,K , w̃b,K−1 , . . . , w̃b,0 ]
305
This is equivalent to saying the matrix XA has orthogonal columns. This
orthogonality has important consequences in the field of adaptive filtering,
but is not considered further here.
We now have enough background where we can easily prove (11.32). From
(11.57) we see that R = A−T P A−1 . The matrices A−1 and A−T are upper
triangular with ones on the main diagonal, so their determinant is unity.
Because the determinant ofQa product is the product of determinants, we
therefore see that detR = i Pi . The second half of (11.32) follows from
(11.29).
R−1 = AP −1 AT . (11.64)
306
Bibliography
[1] G. Golub and C. Van Loan, Matrix Computations. 3rd edn The Johns
Hopkins Univ, 1996.
[2] S. Marple Jr, Digital Spectral Analysis Prentice-Hall, 1987.
[3] S. Haykin, Nonlinear methods of spectral analysis. Springer Science &
Business Media, 2006, vol. 34.
[4] A. J. Laub, Matrix analysis for scientists and engineers. Siam, 2005,
vol. 91.
[5] N. K. Sinha and G. J. Lastman, Microcomputer-based numerical meth-
ods for science an d engineering, 1988.
[6] K. Petersen, M. Pedersen et al., “The matrix cookbook, vol. 7,” Tech-
nical University of Denmark, vol. 15, 2008.
[7] A. Papoulis, Random variables and stochastic processes. McGraw Hill,
1994.
[8] S. Haykin, Adaptive Filter Theory Prentice Hall, 4th Ed. Englewood
Cliffs, NJ, USA, 2001.
[9] R. Schmidt, “Multiple emitter location and signal parameter estima-
tion,” IEEE transactions on antennas and propagation, vol. 34, no. 3,
pp. 276–280, 1986.
[10] S. Haykin, J. Reilly, V. Kezys, and E. Vertatschitsch, “Some aspects
of array signal processing,” in IEE Proceedings F (Radar and Signal
Processing), vol. 139, no. 1. IET, 1992, pp. 1–26.
[11] T. M. Cover and J. Thomas, Elements of information theory. John
Wiley & Sons, 1999.
307
[12] H. L. Van Trees, Detection, estimation, and modulation theory, part I:
detection, estimation, and linear modulation theory. John Wiley &
Sons, 2004.
[13] L. L. Scharf and C. Demeure, Statistical signal processing: detection,
estimation, and time series analysis. Prentice Hall, 1991.
[14] J. H. Wilkinson, The algebraic eigenvalue problem. Clarendon press
Oxford, 1965, vol. 87.
[15] G. Strang, Linear Algebra and its Applications. Harcourt Brace Jo-
vanovich College Publishers, 1988.
[16] S. Haykin, Communication systems. John Wiley & Sons, 2008.
[17] Z. J. Koles, “The quantitative extraction and topographic mapping of
the abnormal components in the clinical eeg,” Electroencephalography
and clinical Neurophysiology, vol. 79, no. 6, pp. 440–447, 1991.
[18] P. L. Nunez, R. Srinivasan et al., Electric fields of the brain: the neu-
rophysics of EEG. Oxford University Press, USA, 2006.
[19] I. T. Jolliffe, “Principal components in regression analysis,” in Principal
component analysis. Springer, 1986, pp. 129–155.
[20] R. Rosipal and N. Krämer, “Overview and recent advances in partial
least squares,” in International Statistical and Optimization Perspec-
tives Workshop” Subspace, Latent Structure and Feature Selection”.
Springer, 2005, pp. 34–51.
[21] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”
Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp.
37–52, 1987.
[22] H. Abdi, “Partial least square regression (pls regression),” Encyclopedia
for research methods for the social sciences, vol. 6, no. 4, pp. 792–795,
2003.
[23] P. Geladi and B. R. Kowalski, “Partial least-squares regression: a tu-
torial,” Analytica chimica acta, vol. 185, pp. 1–17, 1986.
[24] S. Wold, A. Ruhe, H. Wold, and W. Dunn, Iii, “The collinearity problem
in linear regression. the partial least squares (pls) approach to general-
ized inverses,” SIAM Journal on Scientific and Statistical Computing,
vol. 5, no. 3, pp. 735–743, 1984.
308
[25] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical
learning: data mining, inference, and prediction. Springer Science &
Business Media, 2009.
[28] P. C. Hansen, “The l-curve and its use in the numerical treatment of
inverse problems,” 1999.
309