CS5785 Homework 4: .PDF .Py .Ipynb
CS5785 Homework 4: .PDF .Py .Ipynb
The homework is generally split into programming exercises and written exercises.
This homework is due on Dec 3rd, 2020 at 11:59 PM ET. Upload your homework to Gradescope
(Canvas->Gradescope). There are two assignments for this homework in Gradescope. Please note a
complete submission should include:
1. A write-up as a single .pdf file. −→ Submit to “Homework 4- Write Up”.
2. Source code for all of your experiments (AND figures) zipped into a single .zip file, in .py files
if you use Python or .ipynb files if you use the IPython Notebook. If you use some other lan-
guage, include all build scripts necessary to build and run your project along with instructions
on how to compile and run your code. If you use the IPython Notebook to create any graphs,
please make sure you also include them. −→ Submit to “Homework 4- Code”.
The write-up should contain a general summary of what you did, how well your solution works, any
insights you found, etc. On the cover page, include the class name, homework number, and team
member names. You are responsible for submitting clear, organized answers to the questions. You
could use online LATEX templates from Overleaf, under “Homework Assignment” or “Project / Lab
Report”.
Please include all relevant information for a question, including text response, equations, figures,
graphs, output, etc. If you include graphs, be sure to include the source code that generated them.
Please pay attention to Canvas for relevant information regarding updates, tips, and policy changes.
You are encouraged (but not required) to work in groups of 2.
P ROGRAMMING E XERCISES
You will need to download extra data files for the following problems. You can find them on Canvas
under Files->Homework_data.
1. Clustering for text analysis. (20 pts) In this problem, you will analyze all the articles from the jour-
nal Science in the year 2000. (Thanks to JSTOR for providing the data.) Many of the parameters of
this analysis will be left for you to decide. For these files you will need to use science2k-vocab.npy
and science2k-titles.npy, which are vectors of terms and titles respectively.
(a) (10 pts) The extra data file science2k-doc-word.npy contains a 1373 × 5476 matrix, where
each row is an article in Science described by 5476 word features. The articles and words are
in the same order as in the vocabulary and titles files above. You can read this file using
numpy.load("science2k-doc-word.npy")
To obtain the features, we performed the following transformation. First, we computed per-
document smoothed word frequencies. Second, we took the log of those frequencies. Finally,
we centered the per-document log frequencies to have zero mean.
Cluster the documents using k-means and various values of k (go up to at least k = 20).
Select a value of k.
For that value, report the top 10 words of each cluster in order of the largest positive distance
from the average value across all data. More specifically, if x is the 5476-vector of average
values across documents and mi is the i th mean, report the words associated with the top
components in mi − x. Report the top ten documents that fall closest to each cluster center.
You can find the titles in the science2k-titles.txt file.
Comment on these results. What has the algorithm captured? How might such an algorithm
be useful?
(b) (10 pts) The file science2k-word-doc.npy is similar, but capture term-wise rather than document-
wise features. That is, for each term, we count the frequency as the number of documents that
term appears in rather than the other way around. This allows us to characterize individual
terms.
This matrix is 5476×1373, where each row is a term in Science described by 1373 “document”
features. These are transformed document frequencies (as above). Repeat the analysis above,
but cluster terms instead of documents. The terms are listed in science2k-vocab.txt
Comment on these results. How might such an algorithm be useful? What is different about
clustering terms from clustering documents?
2. EM algorithm and implementation (25 pts)
(a) (5 pts) Download the Old Faithful Geyser Dataset. The data file contains 272 observations of
(eruption time, waiting time). Treat each entry as a 2 dimensional feature vector. Parse and
plot all data points on 2-D plane.
(b) (10 pts) Implement a bimodal GMM model to fit all data points using EM algorithm. Explain
the reasoning behind your termination criteria. For this problem, we assume the covariance
2
CS5785 Fall 2020: Homework 4 Page 3
matrix is spherical (i.e., it has the form of σ2 I for scalar σ) and you can randomly initialize
Gaussian parameters. For evaluation purposes, please submit the following figures:
• Plot the trajectories of two mean vectors in 2 dimensions (i.e., coordinates vs. iteration).
• Run your program for 50 times with different initial parameter guesses. Show the distri-
bution of the total number of iterations needed for algorithm to converge.
(c) (10 pts) Repeat the task in (b) but with the initial guesses of the parameters generated from
the following process:
• Run a k-means algorithm over all the data points with K = 2 and label each point with
one of the two clusters.
• Estimate the first guess of the mean and covariance matrices using maximum likelihood
over the labeled data points.
Compare the algorithm performances of (b) and (c).
3. Eigenface for face recognition (40 pts).
In this assignment you will implement the Eigenface method for recognizing human faces. You will
use face images from The Yale Face Database B, where there are 64 images under different lighting
conditions per each of 10 distinct subjects, 640 face images in total. With your implementation,
you will explore the power of the Singular Value Decomposition (SVD) in representing face images.
Read more (optional):
• Eigenface on Wikipedia: https://en.wikipedia.org/wiki/Eigenface
• Eigenface on Scholarpedia: http://www.scholarpedia.org/article/Eigenfaces
(a) (2 pts) Download and unzip The Face Dataset (faces.zip), You will find a folder called im-
ages which contains all the training and test images; train.txt and test.txt specifies the training
set and test (validation) set split respectively, each line gives an image path and the corre-
sponding label.
(b) (2 pts) Load the training set into a matrix X: there are 540 training images in total, each has
50 × 50 pixels that need to be concatenated into a 2500-dimensional vector. So the size of X
should be 540 × 2500, where each row is a flattened face image. Pick a face image from X and
display that image in grayscale. Do the same thing for the test set. The size of matrix Xtest for
the test set should be 100 × 2500.
Below is the sample code for loading data from the training set. You can directly run it in
Jupyter Notebook:
1 import numpy as np
2 from scipy import misc
3
CS5785 Fall 2020: Homework 4 Page 4
(c) (3 pts) Average Face. Compute the average face µ from the whole training set by summing up
every row in X then dividing by the number of faces. Display the average face as a grayscale
image.
(d) (3 pts) Mean Subtraction. Subtract average face µ from every row in X. That is, xi := xi −
µ, where xi is the i -th row of X. Pick a face image after mean subtraction from the new X
and display that image in grayscale. Do the same thing for the test set Xtest using the pre-
computed average face µ in (c).
(e) (10 pts) Eigenface. Perform eigendecomposition on XT X = VΛVT to get eigenvectors VT ,
where each row of VT has the same dimension as the face image. We refer to vi , the i -th
row of VT , as i -th eigenface. Display the first 10 eigenfaces as 10 images in grayscale.
(f) (10 pts) Eigenface Feature. The top r eigenfaces VT [: r, :] = {v 1 , v 2 , . . . , v r }T span an r -dimensional
linear subspace of the original image space called face space, whose origin is the average face
µ, and whose axes are the eigenfaces {v 1 , v 2 , . . . , v r }. Therefore, using the top r eigenfaces
{v 1 , v 2 , . . . , v r }, we can represent a 2500-dimensional face image z as an r -dimensional fea-
ture vector f: f = VT [: r, :] z = [v 1 , v 2 , . . . , v r ]T z. Write a function to generate r -dimensional
feature matrix F and Ftest for training images X and test images Xtest , respectively (to get F,
multiply X to the transpose of first r rows of VT , F should have same number of rows as X and
r columns; similarly for Xtest ).
(g) (10 pts) Face Recognition. Extract training and test features for r = 10. Train a Logistic Re-
gression model using F and test on Ftest . Report the classification accuracy on the test set.
Plot the classification accuracy on the test set as a function of r when r = 1, 2, . . . , 200. Use
“one-vs-rest” logistic regression, where a classifier is trained for each possible output label.
Each classifier is trained on faces with that label as positive data and all faces with other labels
as negative data. sklearn calls this “ovr” mode.
4
CS5785 Fall 2020: Homework 4 Page 5
W RITTEN E XERCISES
1. SVD and eigendecomposition. (5 pts) Show that from the Singular Value Decomposition (SVD) of
a matrix X , we can obtain the eigendecomposition of X T X . This tells us that we can do an SVD of
X and get same result as the eigendecomposition of X T X but the SVD is faster and easier.
2. Weights for clustering (ESL Exercise 14.1). (5 pts) In clustering algorithms like K-means, we need
to compute distances in the feature space. Sometimes people use weights to value some feature
more than others. Show that weighted Euclidean distance for p dimensional data points x i and x i 0
Pp
l =1
w l (x i l − x i 0 l )2
d e(w) (x i , x i 0 ) = Pp (1)
l =1 l
w
satisfies
p
d e(w) (x i , x i 0 ) = d e (z i , z i 0 ) = (z i l − z i 0 l )2 ,
X
(2)
l =1
where à !1/2
wl
z i l = x i l · Pp . (3)
l =1
wl
Thus weighted Euclidean distance based on x is equivalent to unweighted Euclidean distance
based on a proper transformed data z.
3. SVD of Rank Deficient Matrix. (10 pts) You can use NumPy library for this problem. In this ex-
treme example of dimensionality reduction, we are going to see the original data’s dimensionality
is redundant and can be completely captured by lower dimensions. Consider matrix M . It has rank
2, as you can see by observing that there times the first column minus the other two columns is 0.
1 0 3
3 7 2
M = 2 −2 8 . (4)
0 −1 1
5 8 7