Assignment 1
Assignment 1
2. Concentration [5 pts]
The instructors would like to know what percentage of the students like the Introduction to Machine Learn-
ing course. Let this unknown—but hopefully very close to 1—quantity be denoted by µ. To estimate µ, the
instructors created an anonymous survey which contains this question:
Each student can only answer this question once, and we assume that the distribution of the answers is i.i.d.
(b) Let the above estimator be denoted by µ̂. How many students should the instructors ask if they want
the estimated value µ̂ to be so close to the unknown µ such that
1
3. MAP of Multinational Distribution [10 pts]
You have just got a loaded 6-sided dice from your statistician friend. Unfortunately, he does not remem-
ber its exact probability distribution p1 , p2 , ..., p6 . He remembers, however, that he generated the vector
(p1 , p2 , . . . , p6 ) from the following Dirichlet distribution.
P6 6 6
Γ( i=1 ui ) Y ui −1 X
P(p1 , p2 , . . . , p6 ) = Q6 pi δ( pi − 1),
i=1 Γ(ui ) i=1 i=1
where he chose ui = i for all i = 1, . . . , 6. Here Γ denotes the gamma function, and δ is the Dirac delta. To
estimate the probabilities p1 , p2 , . . . , p6 , you roll the dice 1000 times and then observe that side i occurred
P6
ni times ( i=1 ni = 1000).
(a) Prove that the Dirichlet distribution is conjugate prior for the multinomial distribution.
(b) What is the posterior distribution of the side probabilities, P(p1 , p2 , . . . , p6 |n1 , n2 , . . . , n6 )?
(1) (p)
where Xi = [Xi . . . Xi ].
Page 2 of 6
b) To prevent overfitting, we want the weights to be small. To achieve this, instead of maximum
conditional likelihood estimation M(C)LE for logistic regression:
n
Y
max P (Yi |Xi , w0 , . . . , wd ),
w0 ,...,wd
i=1
c) Draw a set of training data with three labels and the decision boundary resulting from a multi-class
logistic regression. (The boundary does not need to be quantitatively correct but should qualitatively
depict how a typical boundary from multi-class logistic regression would look like.)
Handout - http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/assignments/homework1.tar
You have been provided with the following data files in the handout:
• train.data - Contains bag-of-words data for each training document. Each row of the file represents
the number of occurrences of a particular term in some document. The format of each row is (docId,
termId, Count).
Page 3 of 6
• test.data - Contains bag-of-words data for each testing document. The format of this file is the same
as that of the train.data file.
• test.label - Contains a label for each document in the testing data.
For this assignment, you need to write code to complete the following functions:
• logPrior(trainLabels) - Computes the log prior of the training data-set. (5 pts)
• logLikelihood(trainData, trainLabels) - Computes the log likelihood of the training data-set. (7 pts)
• naiveBayesClassify(trainData, trainLabels, testData) - Classifies the data using the Naive Bayes algo-
rithm. (13 pts)
Implementation Notes
1. We compute the log probabilities to prevent numerical underflow when calculating multiplicative prob-
abilities. You may refer to this article on how to perform addition and multiplication in log space.
2. You may encounter words during classification that you haven’t during training. This may be for a
particular class or over all. Your code should deal with that. Hint: Laplace Smoothing
3. Be memory efficient and please do not create a document-term-matrix in your code. That would require
upwards of 600MB of memory.
Due to popular demand, we are allowing the solution to be coded in 3 languages: Octave, Julia, and Python.
Julia is a popular new open-source language developed for numerical and scientific computing was well
as beginning effective for general programming purposes. This is the first time this language is being sup-
ported in a CMU course.
Python is an extremely flexible language and is popular in industry and the data science community. Pow-
erful python libraries would not be available to you.
For Octave and Julia, a blank function interface has been provided for you (in the handout). However, for
Python, you will need to perform the I/O for the data files and ensure the results are written to the correct
output files.
Challenge Question
This question is not graded, but it is highly recommended that you try it. In the above question, we are using
all the terms from the vocabulary to make a prediction. This would lead to a lot of noisy features. Although
it seems counter-intuitive, classifiers built from a smaller vocabulary perform better because they generalize
better over unseen data. Noisy features that are not well-represented often skew the perceived distribution
of words, leading to classification errors. Therefore, the classification can be improved by selecting a subset
of extremely effective words.
Write a program to select a subset of the words from the vocabulary provided to you and then use this subset
to run your naive bayes classification again. Verify changes in accuracy. TF-IDF and Information Theory
are good places to start looking.
Page 4 of 6
Support Vector Machines (Jit)
1. SVM Matching [15 points]
Figure 1 (at the end of this problem) plots SVM decision boundries resulting from using different kernels
and/or different slack penalties. In Figure 1, there are two classes of training data, with labels yi ∈ {−1, 1},
represented by circles and squares respectively. The SOLID circles and squares represent the Support Vectors.
Determine which plot in Figure 1 was generated by each of the following optimization problems. Explain
your reasoning for each choice.
1.
n
1 X
min w · w + C ξi
2 i=1
s.t. ∀i = 1, · · · , n:
ξi ≥ 0
(w · xi + b)yi − (1 − ξi ) ≥ 0
and C = 0.1.
2.
n
1 X
min w · w + C ξi (1)
2 i=1
s.t. ∀i = 1, · · · , n:
ξi ≥ 0
(w · xi + b)yi − (1 − ξi ) ≥ 0
and C = 1.
3.
n
X 1X
max αi − αi αj yi yj K(xi , xj ) (2)
i=1
2 i,j
Pn
s.t. i=1 αi yi = 0;
αi ≥ 0, ∀i = 1, · · · , n;
where K(u, v) = u · v + (u · v)2 .
4.
n
X 1X
max αi − αi αj yi yj K(xi , xj ) (3)
i=1
2 i,j
Pn
s.t. i=1 αi yi = 0;
αi ≥ 0, ∀i = 1, · · · , n;
2
where K(u, v) = exp(− ku−vk2 ).
5.
n
X 1X
max αi − αi αj yi yj K(xi , xj ) (4)
i=1
2 i,j
Pn
s.t. i=1 αi yi = 0;
αi ≥ 0, ∀i = 1, · · · , n;
where K(u, v) = exp(− k u − v k2 ).
Page 5 of 6
Figure 1: Induced Decision Boundaries
Page 6 of 6