HW 1

Intro to Machine Learning (CS771A, Autumn 2020)
Homework 1
Due Date: October 25, 2020 (11:59pm)
Instructions:
• Only electronic submissions will be accepted. Your main PDF writeup must be typeset in LaTeX (please
also refer to the “Additional Instructions” below).
• The PDF writeup containing your solution has to be submitted via Gradescope https://www.gradescope.
com/.
• We have created your Gradescope account (you should have received the notification). Please use your
IITK CC ID (not any other email ID) to login. Use the “Forgot Password” option to set your password.
Additional Instructions
• We have provided a LaTeX template file hw1sol.tex to help typeset your PDF writeup. There is also
a style file ml.sty that contain shortcuts to many of the useful LaTeX commends for doing things such
as boldfaced/calligraphic fonts for letters, various mathematical/greek symbols, etc., and others. Use of
these shortcuts is recommended (but not necessary).
• Your answer to every question should begin on a new page. The provided template is designed to do this
automatically. However, if it fails to do so, use the \clearpage option in LaTeX before starting the
answer to a new question, to enforce this.
• While submitting your assignment on the Gradescope website, you will have to specify on which page(s)
is question 1 answered, on which page(s) is question 2 answered etc. To do this properly, first ensure that
the answer to each question starts on a different page.
• Be careful to flush all your floats (figures, tables) corresponding to question n before starting the answer
to question n + 1 otherwise, while grading, we might miss your important parts of your answers.
• Your solutions must appear in proper order in the PDF file i.e. solution to question n must be complete in
the PDF file (including all plots, tables, proofs etc) before you present a solution to question n + 1.
1
Problem 1 (15 marks)
(Absolute Loss Regression with Sparsity) The absolute loss regression problem with `1 regularization is
N
X
wopt = argmin |yn − w> xn | + λ||w||1
w
n=1
PD
where ||w||1 = d=1 |wd |, |.| is the absolute value function, and λ > 0 is the regularization hyperparameter.
Is the above objective function convex? You don’t need to prove this formally; just a brief reasoning based on
properties of other functions that are known to be convex/non-convex would be fine.
Derivate the expression for the (sub)gradient vector for this model.

(Feature
PN Masking > as Regularization) Consider linear regression model by minimizing the squared loss func-
tion n=1 (yn − w xn )2 . Suppose we decide to mask out or “drop” each feature xnd of each input xn ∈ RD ,
independently, with probability 1 − p (equivalently, retaining the feature with probability p). Masking or drop-
ping out basically means that we will set the feature xnd to 0 with probability 1 − p. Essentially, it would be
equivalent to replacing each input xn by x̃n = xn ◦ mn , where ◦ denotes elementwise product and mn denotes
the D × 1 binary mask vector with mnd ∼ Bernoulli(p) (mnd = 1 means the feature xnd was retained; mnd = 0
means the feature xnd was masked/zeroed).
Let us now define a new loss function using these masked inputs as follows: N > 2
P
n=1 (yn − w x̃n ) . Show that
minimizing the expected value of this new loss function (where the expectation is used since the mask vectors
mn are random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of
this regularized loss function.

(Multi-output Regression with Reduced Number of Parameters) Consider the multi-output regression in
which each output yn ∈ RM in a real-valued vector, rather than a scalar. Assuming a linear model, we can
model the outputs as Y ≈ XW, where X is the N × D feature matrix and Y is N × M response matrix with
row n being yn> (note that each column of Y denotes one of the M responses), and W is the D × M weight
matrix, with
P itsP M columns containing the M weight vectors w1 , w2 , . . . , wM . Let’s define a squared error loss
function N n=1
M > 2
m=1 (ynm −wm xn ) , which is just the usual squared error but summed over all the M outputs.
Firstly, verify that this can also be written in a more compact notation as TRACE[(Y − XW)> (Y − XW)].
Further, we will assume that the weight matrix W can be written as a product of two matrices, i.e., W = BS
where B is D × K and S is K × M (assume K < min{D, M }). Note that there is a benefit of modeling W
this way, since now we need to learn only K × (D + M ) parameters as opposed to D × M parameters and,
if K is small, this can significantly reduce the number of parameters (in fact, reducing the effective number of
parameters to be learned is another way of regularizing a machine learning model). Note (you can verify) that
in this formulation, each wm can be written as a linear combination of K columns of B.
With the proposed representation of W, the new objective will be TRACE[(Y − XBS)> (Y − XBS)] and you
need to learn both B and S by solving the following problem:
{B̂, Ŝ} = arg min TRACE[(Y − XBS)> (Y − XBS)]

B,S
We will ignore regularization on B and S for brevity/simplicity.
2
Derive an alternating optimization (ALT-OPT) algorithm to learn B and S, clearly writing down the expressions
for the updates of B and S. Are both subproblems (solving for B and solving for S) equally easy/difficult in
this ALT-OPT algorithm? If yes, why? If no, why not?
Note: Since B and S are matrices, if you want, please feel free to use results for matrix derivatives (results you
will need can be found in Sec. 2.5 of the Matrix Cookbook). However, the problem can be solved even without
using matrix derivative results with some rearragement of terms and using vector derivatives.

Ridge Regression using Newton’s Method Consider the ridge regression problem:
N
1X λ 1 λ
ŵ = arg min (yn − w> xn )2 + w> w = arg min (y − Xw)> (y − Xw) + w> w
w 2 2 w 2 2
n=1
where X is the N × D feature matrix and y is the N × 1 vector of labels of the N training examples. Note
that the factor of 21 has been used in the above expression just for convenience of derivations required for this
problem and does not change the solution to the problem.
Derive the Newton’s method’s update equations for each iteration. For this model, how many iterations would
the Newton’s method will take to converge?

(Dice Roll) You have a six-faced dice which you roll N times and record the number of times each of its six
faces are observed. Suppose these numbers are N1 , N2 , . . . , N6 , respectively. Assume that the probability of a
random roll of the dice showning the k th face (k = 1, 2, . . . , 6) to be equal to πk ∈ (0, 1).
Assuming an appropriate conjugate prior for the probability vector π = [π1 , π2 , . . . , π6 ], derive its MAP esti-
mate. In which situation(s), you would expect the MAP solution to be better than the MLE solution?
Also derive the full posterior distribution over π using the same prior that you used for MAP estimate. Given this
posterior, can you get the MLE and MAP estimate without solving the MLE and MAP optimization problems?
If yes, how? If no, why not?

HW 1

Uploaded by

Copyright:

Available Formats

HW 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW 1

Uploaded by

Copyright:

Available Formats

Intro to Machine Learning (CS771A, Autumn 2020)

Problem 2 (15 marks)

Problem 3 (40 marks)

{B̂, Ŝ} = arg min TRACE[(Y − XBS)> (Y − XBS)]

We will ignore regularization on B and S for brevity/simplicity.

Problem 4 (10 marks)

Problem 5 (20 marks)

You might also like