CS 229, Summer 2019 Problem Set #2 Solutions
CS 229, Summer 2019 Problem Set #2 Solutions
CS 229, Summer 2019 Problem Set #2 Solutions
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/summer2019/cs229. (3)
If you missed the first lecture or are unfamiliar with the collaboration or honor code policy,
please read the policy on the course website before starting work. (4) For the coding problems,
you may not use any libraries except those defined in the provided environment.yml file. In
particular, ML-specific libraries such as scikit-learn are not permitted. (5) To account for late
days, the due date is Monday, July 29 at 11:59 pm. If you submit after Monday, July 29 at
11:59 pm, you will begin consuming your late days. If you wish to submit on time, submit before
Monday, July 29 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly rec-
ommend typesetting your solutions via LATEX. All students must also submit a zip file of their
source code to Gradescope, which should be created using the make zip.py script. You should
make sure to (1) restrict yourself to only using libraries included in the environment.yml file,
and (2) make sure your code runs without errors. Your submission may be evaluated by the
auto-grader using a private test set, or used for verifying the outputs reported in the writeup.
CS229 Problem Set #2 2
(a) [2 points] What is the most notable difference in training the logistic regression model on
datasets A and B?
Answer:
The training algorithm converges on data set A quickly while can’t converge on data set B (at
least in a reasonable time).
(b) [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but
not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate
your hypothesis for the misbehavior. Remember, you should address why your explanation
does not apply to A.
Hint: The issue is not a numerical rounding or over/underflow error.
Answer:
We can see from the plots (on top of this page) that in data set 2, the two classes are perfectly
seperated. Thus, it makes θ increase continuously to increase the likelihood. In data set 1, the
data points are mixed to some extent.
(c) [5 points] For each of these possible modifications, state whether or not it would lead to
the provided training algorithm converging on datasets such as B. Justify your answers.
i. Using a different constant learning rate.
ii. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1/t2 ,
where t is the number of gradient descent iterations thus far).
iii. Linear scaling of the input features.
iv. Adding a regularization term kθk22 to the loss function.
CS229 Problem Set #2 3
ii. The algorithm won’t converge on data set 2. The reason is similar to question i, as learning
rate is not what caused the inability to converge.
iii. The algorithm won’t converge on data set 2. Again, this problem is caused by the prefect
seperateness, not by input features alone.
iv. The algorithm will converge on data set 2. This method restricts θ from endless increasing.
v. The algorithm will converge on data set 2. This will avoid the situation where data points
are perfectly seperated.
(d) [3 points] Are support vector machines, vulnerable to datasets like B? Why or why not?
Give an informal justification.
Answer:
No, because they look for geometric margins ( and maximize them) instead of functional margins.
CS229 Problem Set #2 4
(a) [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the get words, create dictionary,
and transform text functions within our provided src/spam.py. Do note the correspond-
ing comments for each function for instructions on what specific processing is required.
The provided code will then run your functions and save the resulting dictionary into
spam dictionary and a sample of the resulting training matrix into
spam sample train matrix.
In your writeup, report the vocabular size after the pre-processing step. You do not need
to include any other output for this subquestion.
Answer:
The size is 1758 when only lowercase is performed. I also removed punctuations and numbers,
after which the size is 1539.
(b) [10 points] In this question you are going to implement a naive Bayes classifier for spam
classification with multinomial event model and Laplace smoothing (refer to class notes
on Naive Bayes for details on Laplace smoothing in Section 2.3 of notes2.pdf).
Code your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in src/spam/spam.py.
Now src/spam/spam.py should be able to train a Naive Bayes model, compute your predic-
tion accuracy and then save your resulting predictions to spam naive bayes predictions.
In your writeup, report the accuracy of the trained model on the test set.
Remark. If you implement
Q naive Bayes the straightforward way, you will find that the
computed p(x|y) = i p(xi |y) often equals zero. This is because p(x|y), which is the
product of many numbers less than one, is a very small number. The standard computer
representation of real numbers cannot handle numbers that are too small, and instead
rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute
Naive Bayes’ predicted class labels without explicitly representing very small numbers such
as p(x|y). [Hint: Think about using logarithms.]
Answer:
0.976702 with lowercasing 0.983871 after removing punctuations and numbers from the text in
the preprocessing step
1 Almeida, T.A., Gmez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New
Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11),
Mountain View, CA, USA, 2011.
CS229 Problem Set #2 5
(c) [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a
particular class. We can try to get an informal sense of how indicative token i is for the
SPAM class by looking at:
p(xj = i|y = 1) P (token i|email is SPAM)
log = log .
p(xj = i|y = 0) P (token i|email is NOTSPAM)
Complete the get top five naive bayes words function within the provided code using
the above formula in order to obtain the 5 most indicative tokens.
Report the top five words in your writeup.
Answer:
I got these: claim won prize tone urgent! (with lowercasing) and : claim prize won tone
guaranteed (after removing punctuations and numbers)
(d) [2 points] Support vector machines (SVMs) are an alternative machine learning model that
we discussed in class. We have provided you an SVM implementation (using a radial basis
function (RBF) kernel) within src/spam/svm.py (You should not need to modify that
code).
One important part of training an SVM parameterized by an RBF kernel (a.k.a Gaussian
kernel) is choosing an appropriate kernel radius parameter.
Complete the compute best svm radius by writing code to compute the best SVM radius
which maximizes accuracy on the validation dataset. Report the best kernel radius you
obtained in the writeup.
Answer:
When using 0.1 as radius, accuracy: 0.971326.
CS229 Problem Set #2 6
[Hint: For part (e), the answer is that K is indeed a kernel. You still have to prove it, though.
(This one may be harder than the rest.) This result may also be useful for another part of the
problem.]
Answer:
(b)Not necessarily. While We know that z T (K1 )z >= 0 and z T (K2 )z >= 0,
we are not sure which one is larger, thus z T Kz >= 0 doesn’t necessarily hold.
(c)Yes. We know that z T (K1 )z >= 0 and a is a positive real number, thus z a ∗ T (K1 )z >= 0
(d)No. We know that z T (K1 )z >= 0 and a is a positive real number, thus z a ∗ T (K1 )z <= 0
CS229 Problem Set #2 7
(g)Yes. We know that K3 is a valid kernel. This is the same as question(c) if we set a to be 1.
(h)Yes. p(K1 (x, z)) can be view as a linear combination of K1 (x, z).
We know that the coefficients are positive, and that aK(x, z) is a valid kernel,
the combinations of valid kernels is also a valid kernel.
CS229 Problem Set #2 8
(a) [3 points] Let K be a Mercer kernel corresponding to some very high-dimensional feature
mapping φ. Suppose φ is so high-dimensional (say, ∞-dimensional) that it’s infeasible to
ever represent φ(x) explicitly. Describe how you would apply the “kernel trick” to the
perceptron to make it work in the high-dimensional feature space φ, but without ever
explicitly computing φ(x).
[Note: You don’t have to worry about the intercept term. If you like, think of φ as having
the property that φ0 (x) = 1 so that this is taken care of.] Your description should specify:
i. [1 points] How you will (implicitly) represent the high-dimensional parameter vector
θ(i) , including how the initial value θ(0) = 0 is represented (note that θ(i) is now a
vector whose dimension is the same as the feature vectors φ(x));
ii. [1 points] How you will efficiently make a prediction on a new input x(i+1) . I.e., how
T
you will compute hθ(i) (x(i+1) ) = g(θ(i) φ(x(i+1) )), using your representation of θ(i) ;
and
iii. [1 points] How you will modify the update rule given above to perform an update to θ
on a new training example (x(i+1) , y (i+1) ); i.e., using the update rule corresponding to
the feature mapping φ:
θ(i+1) := θ(i) + α(y (i+1) − hθ(i) (x(i+1) ))φ(x(i+1) )
Answer:
i. K(θ, φ(x))
>
(i)
ii. hθ x(i+1) = g θ(i) φ x(i+1) = g K θ(i) , φ x(i+1)
iii. θ(i+1) φ x(i+1) := θ(i) φ x(i+1) + α y (i+1)
− hθ(i) x(i+1) φ x(i+1) ) φ x(i+1)
K θ(π+1) , φ x(t+1) ) := K θ(i) , φ x(i+1) + α y (i+1) − g(K θ(i) , φ x(i+1) ))K φ x(i+1) , φ x(i+1) .
(b) [10 points] Implement your approach by completing the initial state, predict, and
update state methods of src/perceptron/perceptron.py.
We provide two kernels, a dot-product kernel and a radial basis function (RBF) kernel. Run
src/perceptron/perceptron.py to train kernelized perceptrons on src/perceptron/train.csv.
The code will then test the perceptron on src/perceptron/test.csv and save the resulting
predictions in the src/perceptron/ folder. Plots will also be saved in src/perceptron/.
Include the two plots (corresponding to each of the kernels) in your writeup, and indicate
which plot belongs to which kernel.
Answer:
CS229 Problem Set #2 9
Figure 2: dot
CS229 Problem Set #2 10
Figure 3: rbf
CS229 Problem Set #2 11
(c) [2 points]
One of the provided kernels performs extremely poorly in classifying the points. Which
kernel performs badly and why does it fail?
Answer:
Dot-product performs bad because i this case the boundary is a curve and the problem is not
linear.
CS229 Problem Set #2 12
The data and starter code for this problem can be found in
• src/mnist/nn.py
• src/mnist/images train.csv
• src/mnist/labels train.csv
• src/mnist/images test.csv
• src/mnist/labels test.csv
The starter code splits the set of 60,000 training images and labels into a set of 50,000 examples
as the training set, and 10,000 examples for dev set.
To start, you will implement a neural network with a single hidden layer and cross entropy loss,
and train it with the provided data set. Use the sigmoid function as activation for the hidden
layer, and softmax function for the output layer. Recall that for a single example (x, y), the
cross entropy loss is:
K
X
CE(y, ŷ) = − yk log yˆk ,
k=1
K
where ŷ ∈ R is the vector of softmax outputs from the model for the training example x, and
y ∈ RK is the ground-truth vector for the training example x such that y = [0, ..., 0, 1, 0, ..., 0]>
contains a single 1 at the position of the correct class (also called a “one-hot” representation).
For n training examples, we average the cross entropy loss over the n examples.
n n K
1X 1 X X (i) (i)
J(W [1] , W [2] , b[1] , b[2] ) = CE(y (i) , ŷ (i) ) = − yk log ŷk .
n i=1 n i=1
k=1
CS229 Problem Set #2 13
The starter code already converts labels into one hot representations for you.
Instead of batch gradient descent or stochastic gradient descent, the common practice is to use
mini-batch gradient descent for deep learning tasks. In this case, the cost function is defined as
follows:
B
1 X
JM B = CE(y (i) , ŷ (i) )
B i=1
where B is the batch size, i.e. the number of training example in each mini-batch.
(b) [7 points] Now add a regularization term to your cross entropy loss. The loss function will
become !
B
1 X
JM B = CE(y , ŷ ) + λ ||W [1] ||2 + ||W [2] ||2
(i) (i)
B i=1
Be careful not to regularize the bias/intercept term. Set λ to be 0.0001. Implement the
regularized version and plot the same figures as part (a). Be careful NOT to include the
regularization term to measure the loss value for plotting (i.e., regularization should only
be used for gradient calculation for the purpose of training).
Submit the two new plots obtained with regularized training (i.e loss (without
regularization term) vs epoch, and accuracy vs epoch) in your writeup.
Compare the plots obtained from the regularized model with the plots obtained from the
non-regularized model, and summarize your observations in a couple of sentences.
As in the previous part, save the learnt parameters (weights and biases) into a different file
so that we can initialize from them next time.
Answer:
CS229 Problem Set #2 15
With regularization, the training accuracy seems similar to without regularization, while the test
accuracy is slightly higher than without regularization, especially after the first few epochs.
(c) [3 points] All this while you should have stayed away from the test data completely. Now
that you have convinced yourself that the model is working as expected (i.e, the observations
you made in the previous part matches what you learnt in class about regularization), it is
finally time to measure the model performance on the test set. Once we measure the test
set performance, we report it (whatever value it may be), and NOT go back and refine the
model any further.
Initialize your model from the parameters saved in part (a) (i.e, the non-regularized model),
and evaluate the model performance on the test data. Repeat this using the parameters
saved in part (b) (i.e, the regularized model).
Report your test accuracy for both regularized model and non-regularized model.
Answer: regularized: 0.9320, non regularized: 0.9287
CS229 Problem Set #2 16
Compare this to the maximum likelihood estimation (MLE) we have seen previously:
In this problem, we explore the connection between MAP estimation, and common regularization
techniques that are applied with MLE estimation. In particular, you will show how the choice
of prior distribution over θ (e.g., Gaussian or Laplace prior) is equivalent to different kinds of
regularization (e.g., L2 , or L1 regularization). To show this, we shall proceed step by step,
showing intermediate steps.
(a) [3 points] Show that θMAP = argmaxθ p(y|x, θ)p(θ) if we assume that p(θ) = p(θ|x). The
assumption that p(θ) = p(θ|x) will be valid for models such as linear regression where the
input x are not explicitly modeled by θ. (Note that this means x and θ are marginally
independent, but not conditionally independent when y is given.)
Answer:
θMAP = argmaxθ p(θ|x, y) = argmaxθ p(x, y|θ)P (θ) = argmaxθ p(x,y,θ)
P (θ) ∗ P (θ)
p(θ,x,y) p(θ,x,y)
On the other side, argmaxθ p(y|x, θ)∗P (θ) = argmaxθ P (x,θ) ∗P (θ) = argmaxθ P (θ) ∗P (θ)
Thus, the two are equivelent.
(b) [5 points] Recall that L2 regularization penalizes the L2 norm of the parameters while
minimizing the loss (i.e., negative log likelihood in case of probabilistic models). Now we
will show that MAP estimation with a zero-mean Gaussian prior over θ, specifically θ ∼
N (0, η 2 I), is equivalent to applying L2 regularization with MLE estimation. Specifically,
show that
θMAP = arg min − log p(y|x, θ) + λ||θ||22 .
θ
CS229 Problem Set #2 17
(c) [7 points] Now consider a specific instance, a linear regression model given by y = θT x +
where ∼ N (0, σ 2 ). Assume that the random noise (i) is independent for every training
example x(i) . Like before, assume a Gaussian prior on this model such that θ ∼ N (0, η 2 I).
For notation, let X be the design matrix of all the training example inputs where each row
vector is one example input, and ~y be the column vector of all the example outputs.
Come up with a closed form expression for θMAP .
Answer:
2
θ2
P(θ) = √ 12 exp − (θ−µ 0)
2η 2 I ∝ exp − 2η 2 I
2η Iπ
T 2
(ŷ−θ x)
2
P (θ|ŷ, x) = P (ŷ|x, θ)P (θ) ∝ exp − 2σ2 exp − 2ηθ 2 I
θM AP = argminθ [− logP (ŷ|x, θ)P (θ)]
2
(ŷ−θT x)
2
= argminθ [] − log[exp − 2σ2 exp − 2ηθ 2 I ]]
2
= argminθ [ 2σ1 2 kŷ − Xθk2 + 1 2
2η 2 I kθk2 ]
(d) [5 points] Next, consider the Laplace distribution, whose density is given by
1 |z − µ|
fL (z|µ, b) = exp − .
2b b
2
(ŷ−θT x)
= argminθ [] − log[exp − 2σ2 exp − |θ|
b ]]
2
= argminθ [ 2σ1 2 kŷ − Xθk2 + 1b kθk1 ]
2
J(θ) = kXθ − ~y k22 + 2σb kθk1
2
γ = 2σb
Remark: Linear regression with L2 regularization is also commonly called Ridge regression, and
when L1 regularization is employed, is commonly called Lasso regression. These regularizations
can be applied to any Generalized Linear models just as above (by replacing log p(y|x, θ) with
the appropriate family likelihood). Regularization techniques of the above type are also called
weight decay, and shrinkage. The Gaussian and Laplace priors encourage the parameter values
to be closer to their mean (i.e., zero), which results in the shrinkage effect.
Remark: Lasso regression (i.e., L1 regularization) is known to result in sparse parameters,
where most of the parameter values are zero, with only some of them non-zero.