Lab PDF
Lab PDF
Lab PDF
Introduction
(1)
where
= E[X]
R = E (X )(X )t
(2)
(3)
R = EE 1
(4)
(5)
Questions or comments concerning this laboratory should be directed to Prof. Charles A. Bouman,
School of Electrical and Computer Engineering, Purdue University, West Lafayette IN 47907; (765) 4940340; bouman@ecn.purdue.edu
(6)
where x = E t x.
The form of the argument in (10) indicates that the contours of the density p(x) are
= E t X is simply a
ellipsoidal, as illustrated in Figure 1(a). Since the transformation X
rigid rotation of the axes (because E is orthonormal), the relationship in (10) confirms
two things: first that the principal directions of the ellipsoidal contours are given by the
eigenvectors in E, and second, that the lengths of the principal axes are proportional to the
square root of the eigenvalues, k . Notice in Figure 1(b) that the contours in the rotated
{e1 , e2 } coordinate system do not have any diagonal component. This reflects the fact
are uncorrelated.
that the random variables in X
are uncorrelated, we can produce a whitened
Further, since the random variables in X
random vector W with components that are i.i.d./ N (0, I), by simply normalizing the vari
ance of each element of X,
W = 1/2 E t X .
(12)
(a)
(b)
(c)
Figure 1: Contours illustrating the shape of a Gaussian density (p = 2). ek and k are the
eigenvectors and eigenvalues of the covariance matrix of X = (X1 , X2 ). (a) Original density,
(c) density of the whitened random vector
(b) density of the decorrelated random vector X,
2.1
Our goal will be to use Matlab to generate independent Gaussian random vectors, Xi , having
the following covariance:
2
1.2
(14)
RX =
1.2
1
using the transformation in (13). Consider the eigen-decomposition RX = EE t .
1. First generate a set of n = 1000 samples of i.i.d. N (0, I) Gaussian random vectors,
Wi p , with p = 2 and covariance RW = I22 . Place them in a p n matrix W .
i = 1/2 Wi . (Eigenvalues and eigenvectors
2. Next generate the scaled random vectors X
can be computed with Matlabs eig function.)
i.
3. Finally, generate the samples Xi by applying the transformation Xi = E X
and X in separate figure windows. For each, use
4. Produce scatter plots of W , X,
commands similar to plot(W(1,:),W(2,:),.), assuming W is oriented as p n.
Be sure to use axis(equal) after each plot command to force the same scale of the
horizontal and vertical axes.
This exercise will be continued in the following section.
Section 2.1 Report:
and X.
Hand in your scatter plots for W , X,
2.2
Obviously before we can decorrelate or whiten a data set, we first need to know something
about the covariance. We often do not know the true covariance, but we can obtain an
estimate from a set of training data.
Suppose we have a set of n training vectors (i.i.d.), arranged as columns in a p n data
matrix X.
X = [X1 X2 . . . Xn ]
(15)
(Note that weve slightly changed notation from the previous section, so that now Xi are
vectors, and X is a matrix.) If the training vectors are known to be zero mean ( = [0 . . . 0]t ),
then an unbiased estimate of the covariance is
n
X
1
= 1
R
Xi Xit = XX t .
n i=1
n
(16)
In practice, it is often necessary to center the data by first estimating and removing
the sample mean. For example, if the above Xi s are i.i.d. random vectors with unknown
mean, , and covariance, R, then we can use the following to obtain an unbiased covariance
estimate,
n
1X
Xi
=
n i=1
(17)
=
R
1 X
1
(Xi
)(Xi
)t =
ZZ t
n 1 i=1
n1
(18)
where Zi = Xi
are the mean-centered data vectors and and Z = [Z1 Zn ] is the
associated matrix of vectors.
Now having an estimate of the covariance, the whitening transformation of (12) can be
Note that if R
is not full rank, some of the
obtained from the eigen-decomposition of R.
eigenvalues in will be zero. This issue will be discussed further in the next section.
1. Using the 1000 samples of Xi generated in the previous exercise, estimate the covariance
using the expressions in (17) and (18). Produce a listing of the covariance estimate
and compare to the theoretical values.
2. From the covariance estimate, use Matlab to compute the transformation that will
decorrelate the Xi samples, as in (7). Apply this transformation to the data to produce
i.
the zero-mean decorrelated samples X
3. Use Matlab to compute the transformation that will fully whiten the Xi samples, as
in (12). Apply this transformation to the data to produce the zero-mean, identity
covariance samples Wi .
i and Wi , using the same guidelines as before. Also compute,
4. Produce scatter plots of X
W , the covariance estimate of W .
R
Section 2.2 Report:
1. Hand in the theoretical value of the covariance matrix, RX . (Hint: It is given in
equation (14).)
X .
2. Hand in a numerical listing of your covariance estimate R
i and Wi .
3. Hand in your scatter plots for X
W .
4. Hand in a numerical listing of the covariance estimate R
As the previous exercise demonstrated, the eigenvectors and eigenvalues can be estimated
using the eigen-decomposition of the sample covariance,
= EE t .
R
(19)
However, this is often not practical for high-dimensional data, especially if the data dimension, p, is much larger than the number of training images, n. For example, in working with
images the data vectors can be quite large, with p being the number of pixels in the image.
extremely large. For example, the covariance of a 400 400 image would
This can make R
4
in (16) are all
contain 400 or around 25 billion elements! However, since the columns of R
can be no greater than n, hence
linear combinations of the same n vectors, the rank of R
will have, at most, n nonzero eigenvalues. We can compute these n eigenvalues, and the
R
corresponding n eigenvectors without actually computing the covariance matrix. The answer
is in a highly useful matrix factorization, the singular value decomposition (SVD).
The SVD of a p n matrix X with p > n has the following form,
X = U V t
(20)
right singular vectors, and the elements along the diagonal of are the singular values which
are conventionally arranged in descending order.
In an imaging application where X represents a data matrix (each column is a single
image arranged in raster order, for example), it is often the case that p >> n (fewer training
images than pixels), so the SVD has the following structure,
pn
U
=
pn
Vt
nn
nn
(21)
(22)
(23)
Since 2 is diagonal, (22) and (23) are each in the form of an eigen-expansion. So from
the SVD of the data matrix X, we see in (22) that the left singular vectors in U are the n
eigenvectors of XX t corresponding to nonzero eigenvalues, and the singular values in are
the square roots of the corresponding eigenvalues.
= (1/n)XX t , the result in (22) allows the calculation of the non-zero
Now since R
without explicitly computing R
itself, which
eigenvalues and corresponding eigenvectors of R
is especially efficient if n << p. The procedure is summarized as follows:
= ZZ t . (For non-zero-mean data, first subtract the
1. Let Z = 1n X. Notice that R
sample mean,
, from each column of X and divide by n 1.)
2. Compute the SVD of Z = U V t .
and the diagonal elements
3. From (22) we know the n columns of U are eigenvectors of R,
of are the square roots of the corresponding eigenvalues.
The eigenvectors of an image covariance matrix are also called eigenimages. The eigenimages
corresponding to the largest eigenvalues represent the directions in p of the greatest variation among a set of images having that covariance. Therefore, the coordinates of an image
along these eigenvector directions (obtained by projecting the image onto each eigenvector)
provide a useful set of parameters, or a feature vector, characterizing the image. If we let
Um be a matrix containing the first m eigenvectors, Um = [u1 um ], the eigenvector feature
vector, Y , for the image X is computed by
t
X .
Y = Um
(24)
This can be viewed as a specific type of data reduction where a high-dimensional vector
X is represented with a lower dimensional vector Y . Note that Y is not an imageit doesnt
even have the same dimension as X. However, we can obtain an approximation of the
original image X from a linear combination of the eigenimages.
=
X
m
X
uk (utk X) = Um Y
(25)
k=1
It can easily be shown that the mean square error of this approximation is the sum of the
remaining eigenvalues.
p
i
h
X
2
k
(26)
E ||X X|| =
k=m+1
Therefore the synthesis approximation will be closest in the MSE sense if we use the largest
eigenvalue/eigenvector components. Use of the approximation in (25) is commonly referred
to as principal component analysis, or PCA.
4.1
Exercise
In this exercise we will compute eigenvectors associated with images (also called eigenimages) of typed English letters. Training images are provided in the file training data.zip,
which can be downloaded from the lab web page. This file will unzip to a directory named
training data, which contains subdirectories of character images typed in various fonts.
It also contains a Matlab script read data.m that will read in all these training images into
columns of a single matrix X.
Your first task is to compute the eigenvalues and eigenvectors of the estimated image
covariance matrix, as determined by the given training images. However, as discussed in Sec = (1/n)XX t .
tion 3, you should do this without directly computing the image covariance R
An outline for the procedure follows:
1. Use the provided read data.m script to read the images into a vector X.
2. Compute the mean image,
, over the entire data set, and center the data by subtracting the mean image from each column of X.
3. Use the approach described in Section 3 to compute the eigenvalues and eigenvectors
of the image covariance for this data set. Again, you should not compute the (p p)
covariance matrix directly. Note the Matlab syntax [U S V]=svd(Z,0) computes the
SVD matrices of Z in the compact form of (21).
Display the eigenimages associated with the 12 largest eigenvalues. You will have to
reshape each image column vector into a 64 64 image matrix. Use the imagesc command
(rather than image) to automatically scale the displayed gray level range, and use a grayscale
colormap by issuing the command colormap(gray(256)). Use subplot(4,3,i) to place the 12
eigenimages into a single figure. You might want to use the read data.m script for guidance.
Next, for each of the images in the centered data set, compute the projection coefficients,
Y = U t (X
) along the n eigenvectors in U . Note that the projection coefficients Y for
each image is an n1 column vector, so these can all be placed as columns in a single (nn)
matrix.
On the same axes, plot the first 10 projection coefficients for the first four images in the
data set (e.g. X(:, 1 : 4), which corresponds to the letters {a,b,c,d} in a particular font).
To be clear, the figure should contain four graphs on the same axes, and the horizontal axis
should range from 1 to 10. Use the legend command to identify each of the 4 graphs.
Finally, for the first image in the data set, X(:, 1), show the result of synthesizing the
original image using only the first m eigenvectors. Do this for m = 1, 5, 10, 15, 20, 30. Remember to add the mean
back in after the synthesis, and you will again have to reshape
the image column vectors back into a matrix before displaying. Use subplot(3,2,i) and image
to display the six synthesized versions and also produce a plot of the original image.
Section 4 Report:
1. Hand in the figure with the first 12 eigenimages.
2. Hand in the plots of projection coefficients vs. eigenvector number.
3. Hand in the original image, and the 6 resynthesized versions.
Image Classification
In a classification problem we are given an input image x that has to be assigned to one
of several defined classes, Ck . An example which we will explore shortly is a system that
takes an input image containing a text character, and is tasked with identifying the symbol
represented in the image. In problems where the input image x contains a large number of
pixels, eigen-expansions and PCA can be useful for reducing the dimension of the problem,
and in dealing with the common issue of a limited set of training data.
In a probabilistic framework, an image X belonging to class Ck is modeled with a probability distribution given by p(x|Ck ), where the distributions are generally different across
classes. Given an input image x, the classification can proceed by finding the class label, k ,
that yields the greatest posterior probability, P (Ck |X = x).
k = argmax P (Ck |X = x)
(27)
p(x|Ck )P (Ck )
p(x)
k
= argmax p(x|Ck )P (Ck )
= argmax
(28)
(29)
If we assume presently that the prior probability, P (Ck ), is uniform (all classes are equally
likely), then this result corresponds to a maximum likelihood (ML) class estimate.
k = argmax p(x|Ck )
(30)
As an example, consider the case in which the images in each class are Gaussian distributed with a unique mean and covariance, p(x|Ck ) N (k , Rk ). In this case,
1
1
t 1
exp (x k ) Rk (x k )
(31)
k = argmax
(2)p/2 |Rk |1/2
2
k
1
1
p
t 1
= argmax (x k ) Rk (x k ) log(|Rk |) log(2)
(32)
2
2
2
k
(33)
= argmin (x k )t Rk1 (x k ) + log(|Rk |)
k
Notice the first term in (33) represents a weighted distance of the image to the class mean.
In practice, the means k and covariances Rk for each class would be estimated from a
set of training images in which the class of each image is known. However two issues arise
with high-dimensional data. First, the estimate of Rk is often not invertible due to a limited
amount of training data, and second, each covariance Rk is usually enormous in size. Both
of these issues can be addressed by transforming the high dimensional data, x p , to a
lower dimensional vector, y m ,
y = At x
(34)
where the columns of A span a m-dimensional subspace of p . The means and covariances
of each class,
(35)
E [Y |Ck ] = E [At X|Ck ] = At k k
E [(Y E[Y ])(Y E[Y ])t |Ck ] = At Rk A Rk
(36)
Now, how does one choose the transformation, A? One simple choice is the first m
eigenvectors of the global covariance matrix R, as estimated from the entire training data
set (irrespective of class).
R = EE t
(38)
A = [e1 e2 . . . em ]
(39)
This approach is relatively straight forward, but it does not take into account how the
distributions of each class are separated from each other after projecting onto the lower
dimensional subspace. Therefore direct PCA may not be an optimal solution for the purpose
of classification. Another approach is to define a measure of the spread of the distributions
and find the transformation A which maximizes this parameter. One such function is the
Fisher linear discriminant [1], but this is a bit beyond the scope of this lab. In the following
exercise, we will only consider PCA for dimension reduction.
5.1
10
In this exercise, you will implement a classifier using the text character images from the last
section as a training set. In this context the classifier will accept an input image, assumed to
be of a lower-case English letter, and determine which of the 26 English letters it represents.
First you need to reduce the dimension of the training data using PCA.
1. Compute the eigenvectors for the covariance of the combined data set. (You already
did this in Section 4.) You are disregarding class here, so consider the covariance
around the global mean image,
, as in (17) and (18).
2. Form the transformation matrix A in (39) using the first 10 eigenvectors (corresponding
to the 10 largest eigenvalues).
3. Transform each of the original training images in X to a lower dimensional representation Y by first subtracting the global mean image,
, then applying the transformation
t
A. In effect, Y = A (X
).
4. Using the data vectors, Yi , compute the class means and covariances for each of the 26
classes,
k
k
R
|Ck |
1 X (k)
=
Y
|Ck | i=1 i
|Ck |
X (k)
1
(k)
(Y
k )(Yi
k )t
=
|Ck | 1 i=1 i
(40)
(41)
(k)
where we use the notation Yi for the ith training vector of class k, and |Ck | for the
number of training vectors in class k.
Tip: You might find it easiest to use a structure array to store the mean and covariance
matrices for all 26 classes. You can use the following to define the array.
empty_cell=cell(26,2);
params=cell2struct(empty_cell,{M,R},2);
Then the mean vector and covariance matrix for class k can be saved in the variables
params(k).M and params(k).R.
The lab web site provides a file test data.zip, which contains an additional set of 26
character images (not part of the training set). Use each of these images to test the classifier
previously described,
(42)
k = argmin (y k )t Rk1 (y k ) + log(|Rk |) .
k
You will first need to reduce the dimension of each input image using the same transformation
in Step 3 above. Note that you want to project onto the same subspace used for the training
images, so you need to use the same exact A and
you computed in the training stage.
11
Produce a 2-column table showing each input image that is mis-classified, and the character
it was mapped to.
Section 5 Report:
Submit a 2-column table showing for each mis-classified input image: (1) the input character,
and (2) the output from the classifier.
You should have observed this classifier produces a number of errors in this exercise. A
probable reason for this is the limited number of training images available for estimating
the class-dependent covariance matrices, Rk . We might reduce the errors by using a more
constrained matrix, Bk , in place of Rk .
(43)
k = argmin (y k )t Bk1 (y k ) + log(|Bk |)
k
1. Let Bk = k , i.e. assume each class has a different diagonal covariance, where the
elements of k are the diagonal elements of Rk .
2. Let Bk = Rwc , i.e. assume each class has the same covariance, where Rwc is defined as
the average within-class covariance,
Rwc =
Here, K is the number of classes.
K
1 X
Rk
K k=1
(44)
3. Let Bk = , i.e. each class has the same diagonal covariance, where the elements of
are the diagonal elements of the matrix, Rwc , defined above.
4. Let Bk = I, i.e. each class has an identity covariance around a different mean, k .
Note in each of these cases we are still computing the difference between the input and
each class mean k , but each of these cases uses a different scaling matrix, Bk . Re-run the
previous classification test using each of the above modifications.
Section 5 Report:
For each modification, submit a 2-column table showing for each mis-classified input image:
(1) the input character, and (2) the output from the classifier. Also answer the following:
1. Which of the above classifiers worked the best in this experiment?
2. In constraining the covariance, what is the trade off between the accuracy of the data
model and the accuracy of the estimates?
References
[1] C. M. Bishop, Pattern Recognition and Machine Learning, 2006.