HW 1
HW 1
HW 1
Instructions: Your answers to the questions below, including plots and mathematical work,
should be submitted as a single PDF file. It’s preferred that you write your answers using software
that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though if you need to you
may scan handwritten work. You may find the minted package convenient for including source code
in your LATEX document. If you are using LYX, then the listings package tends to work better.
1 Introduction
In this homework you will first solve some probability and linear algebra questions and then you will
implement ridge regression using gradient descent and stochastic gradient descent. We’ve provided
a lot of support Python code to get you started on the right track. References below to particular
functions that you should modify are referring to the support code, which you can download from
the website. If you have time after completing the assignment, you might pursue some of the
following:
• Study up on numpy’s broadcasting to see if you can simplify and/or speed up your code.
• Think about how you could make the code more modular so that you could easily try different
loss functions and step size methods.
• Experiment with more sophisticated approaches to setting the step sizes for SGD (e.g. try
out the recommendations in “Bottou’s SGD Tricks” on the website)
• Instead of taking 1 data point at a time, as in SGD, try minibatch gradient descent, where
you use multiple points at a time to get your step direction. How does this effect convergence
speed? Are you getting computational speedup as well by using vectorized code?
2 Mathematical Fundamentals
The following questions are designed to check how prepared you are to take this class. Familiarity
with linear algebra and probability at the level of these questions is expected for the class.
1
2.1 Probability
Let (X1 , X2 , · · · , Xd ) have a d-dimensional multivariate Gaussian distribution, with mean vector
µ ∈ Rd and covariance matrix Σ ∈ Rd×d , i.e. (X1 , X2 , · · · , Xd ) ∼ N (µ, Σ). Use µi to denote the
ith element of µ and Σij to denote the element at the ith row and j th column of Σ.
1. Let x, y ∈ Rd be two independent samples drawn from N (µ, Σ). Give expression for Ex22
and Ex − y22 . Express your answer as a function of µ and Σ. x2 represents the ℓ2 -norm
of vector x.
3. (Optional) Assume W and R are two Gaussian distributed random variables. Is W + R still
Gaussian? Justify your answer.
x∗ = argminw − x2 ,
x∈Sv
x∗ = M w,
where M is a d × d matrix. Prove that such an M always exists or more precisely find an
expression for M as a function of v1 , v2 , · · · , vk . Compute the eigenvalues and one set of
eigenvectors of M corresponding to the nonzero eigenvalues.
3 Linear Regression
3.1 Feature Normalization
When feature values differ greatly, we can get much slower rates of convergence of gradient-based
algorithms. Furthermore, when we start using regularization (introduced in a later problem),
features with larger values are treated as “more important”, which is not usually what you want.
One common approach to feature normalization is perform an affine transformation (i.e. shift and
rescale) on each feature so that all feature values in the training set are in [0, 1]. Each feature gets
2
its own transformation. We then apply the same transformations to each feature on the test1 set.
It’s important that the transformation is “learned” on the training set, and then applied to the test
set. It is possible that some transformed test set values will lie outside the [0, 1] interval.
Modify function feature_normalization to normalize all the features to [0, 1]. (Can you
use numpy’s “broadcasting” here?) Note that a feature with constant value cannot be normalized
in this way. Your function should discard features that are constant in the training set.
hθ (x) = θT x,
for θ, x ∈ Rd , and we choose θ that minimizes the following “average square loss” objective function:
m
1 2
J(θ) = (hθ (xi ) − yi ) ,
m i=1
1. Let X ∈ Rm×(d+1) be the design matrix, where the i’th row of X is xi . Let y =
T
(y1 , . . . , ym ) ∈ Rm×1 be the “response”. Write the objective function J(θ) as a matrix/vec-
tor expression, without using an explicit summation sign. [Being able to write expressions as
matrix/vector expressions without summations is crucial to making implementations that are
useful in practice, since you can use numpy (or more generally, an efficient numerical linear
algebra library) to implement these matrix/vector operations orders of magnitude faster than
naively implementing with loops in Python.]
2. Write down an expression for the gradient of J (again, as a matrix/vector expression, without
using an explicit summation sign).
3. In our search for a θ that minimizes J, suppose we take a step from θ to θ + ηh, where
h ∈ Rd+1 is the “step direction” (recall, this is not necessarily a unit vector) and η ∈ (0, ∞)
is the “step size” (note that this is not the actual length of the step, which is ηh). Use the
gradient to write down an approximate expression for the change in objective function value
J(θ + ηh) − J(θ). [This approximation is called a “linear” or “first-order” approximation.]
1 Throughout this assignment we refer to the “test” set. It may be more appropriate to call this set the “validation”
set, as it will be a set of data on which we compare the performance of multiple models. Typically a test set is only
used once, to assess the performance of the model that performed best on the validation set.
3
4. Write down the expression for updating θ in the gradient descent algorithm. Let η be the
step size.
5. Modify the function compute_square_loss, to compute J(θ) for a given θ. You might
want to create a small dataset for which you can compute J(θ) by hand, and verify that your
compute_square_loss function returns the correct value.
1. Complete batch_gradient_descent.
2 Ofcourse, it is also given by the more standard definition of directional derivative, limε→0 1ε [J(θ + εh) − J(θ)].
The form given gives a better approximation to the derivative when we are using small (but not infinitesimally small)
ε.
3 Sometimes people say “batch gradient descent” or “full batch gradient descent” to mean gradient descent, defined
as we discussed in class. They do this to distinguish it from stochastic gradient descent and minibatch gradient
descent, which they probably use as their default.
4
2. Now let’s experiment with the step size. Note that if the step size is too large, gradient descent
may not converge4 . Starting with a step-size of 0.1, try various different fixed step sizes to
see which converges most quickly and/or which diverge. As a minimum, try step sizes 0.5,
0.1, .05, and .01. Plot the average square loss as a function of the number of steps for each
step size. Briefly summarize your findings.
3. (Optional) Implement backtracking line search (google it). How does it compare to the best
fixed step-size you found in terms of number of steps? In terms of time? How does the extra
time to run backtracking line search at each step compare to the time it takes to compute the
gradient? (You can also compare the operation counts.)
where λ is the regularization parameter, which controls the degree of regularization. Note that the
bias parameter is being regularized as well. We will address that below.
1. Compute the gradient of J(θ) and write down the expression for updating θ in the gradient
descent algorithm. (Matrix/vector expression – no summations please.)
2. Implement compute_regularized_square_loss_gradient.
3. Implement regularized_grad_descent.
4. For regression problems, we may prefer to leave the bias term unregularized. One approach is
to change J(θ) so that the bias is separated out from the other parameters and left unregular-
ized. Another approach that can achieve approximately the same thing is to use a very large
number B, rather than 1, for the extra bias dimension. Explain why making B large decreases
the effective regularization on the bias term, and how we can make that regularization as weak
as we like (though not zero).
5. (Optional) Develop a formal statement of the claim in the previous problem, and prove the
statement.
7. Now fix B = 1. Choosing a reasonable step-size (or using backtracking line search), find
the θλ∗ that minimizes J(θ) over a range of λ. You should plot the training average square
4 For the mathematically inclined, there is a theorem that if the objective function is convex and differentiable,
and the gradient of the objective is Lipschitz continuous with constant L > 0, then gradient descent converges
for fixed steps of size 1/L or smaller. See https://www.cs.cmu.edu/~ggordon/10725-F12/scribes/10725_
Lecture5.pdf, Theorem 5.1.
5
loss and the test average square loss (just the average square loss part, without the reg-
ularization, in each case) as a function of λ. Your goal is to find λ that gives the mini-
mum average square loss on the test set. It’s hard to predict what λ that will be, so you
should start
your search very broadly, looking over several orders of magnitude. For exam-
ple, λ ∈ 10−7 , 10−5 , 10−3 , 10−1 , 1, 10, 100 . Once you find a range that works better, keep
zooming in. You may want to have log(λ) on the x-axis rather than λ. [If you like, you may
use sklearn to help with the hyperparameter search.]
8. What θ would you select for deployment and why?
(as it does in the empirical risk), stochastic gradient descent (SGD) can be very effective. In SGD,
rather than taking −∇J(θ) as our step direction, we take −∇fi (θ) for some i chosen uniformly at
random from {1, . . . , m}. The approximation is poor, but we will show it is unbiased.
In machine learning applications, each fi (θ) would be the loss on the ith example (and of course
we’d typically write n instead of m, for the number of training points). In practical implementations
for ML, the data points are randomly shuffled, and then we sweep through the whole training
set one by one, and perform an update for each training example individually. One pass through
the data is called an epoch. Note that each epoch of SGD touches as much data as a single step
of batch gradient descent. You can use the same ordering for each epoch, though optionally you
could investigate whether reshuffling after each epoch affects the convergence speed.
1. Show that the objective function
m
1 2
J(θ) = (hθ (xi ) − yi ) + λθT θ
m i=1
m
can be written in the form J(θ) = 1
m i=1 fi (θ) by giving an expression for fi (θ) that makes
the two expressions equivalent.
2. Show that the stochastic gradient ∇fi (θ), for i chosen uniformly at random from {1, . . . , m},
is an unbiased estimator of ∇J(θ). In other words, show that E [∇fi (θ)] = ∇J(θ) m for any
θ. (Hint: It will be easier, notationally, to prove this for a general J(θ) = m 1
i=1 fi (θ),
rather than the specific case of ridge regression. You can start by writing down an expression
for E [∇fi (θ)]...)
3. Write down the update rule for θ in SGD for the ridge regression objective function.
4. Implement stochastic_grad_descent. (Note: You could potentially generalize the code
you wrote for batch gradient to handle minibatches of any size, including 1, but this is not
necessary.)
6
5. Use SGD to find θλ∗ that minimizes the ridge regression objective for the λ and B that
you selected in the previous problem. (If you could not solve the previous problem, choose
λ = 10−2 and B = 1). Try a few fixed step sizes (at least try ηt ∈ {0.05, .005}. Note that
SGD may not converge with fixed step size. Simply note your results. Next try step sizes that
decrease with the step number according to the following schedules: ηt = Ct and ηt = √Ct ,
C ≤ 1. Please include C = 0.1 in your submissions. You’re encouraged to try different values
of C (see notes below for details). For each step size rule, plot the value of the objective
function (or the log of the objective function if that is more clear) as a function of epoch
(or step number, if you prefer) for each of the approaches to step size. How do the results
compare?
Some things to note:
• In this case we are investigating the convergence rate of the optimization algorithm with
different step size schedules, thus we’re interested in the value of the objective function,
which includes the regularization term.
√
• Sometimes the initial step size (C for C/t and C/ t) is too aggressive and will get you
into a part of parameter space from which you can’t recover. Try reducing C to counter
this problem.
• As we’ll learn in an upcoming lecture, SGD convergence is much slower than GD once we
get close to the minimizer. (Remember, the SGD step directions are very noisy versions
of the GD step direction). If you look at the objective function values on a logarithmic
scale, it may look like SGD will never find objective values that are as low as GD gets.
In terminology we’ll learn in Lecture 2, GD has much smaller “optimization error” than
SGD. However, this difference in optimization error is usually dominated by other sources
of error (estimation error and approximation error). Moreover, for very large datasets,
SGD (or minibatch GD) is much faster (by wall-clock time) than GD to reach a point
that’s close [enough] to the minimizer.
• (Optional) There is another variant of SGD, sometimes called averaged SGD, in which
rather than using the last parameter value we visit, say θT , we
Tuse the average of all
parameter values we visit along the optimization path: θ = T1 t=1 θt , where T is total
number of steps taken. Try this approach5 and see how it compares.
η0
6. (Optional) Try a stepsize rule of the form ηt = 1+η 0 λt
, where λ is your regularization constant,
and η0 a constant you can choose. How do the results compare?
4 Risk Minimization
4.1 Square Loss
1. Let y be a random variable with a known distribution, and consider the square loss function
ℓ(a, y) = (a − y)2 . We want to find the action a∗ that has minimal risk. That is, we want to
2
find a∗ = arg mina E (a − y) , where the expectation is with respect to y. Show that a∗ = Ey,
and the Bayes risk (i.e. the risk of a∗ ) is Var(y). In other words, if you want to try to predict
5 Some theory for averaged SGD is given on page 191 of Understanding Machine Learning: From Theory to
Algorithms. Refer to page 195 of the same book for other averaging techniques you can try.
7
the value of a random variable, the best you can do (for minimizing expected square loss) is
to predict the mean of the distribution. Your expected loss for predicting the mean will be
2
the variance of the distribution. [Hint: Recall that Var(y) = Ey 2 − (Ey) .]
2. Now let’s introduce an input. Recall that the expected loss or “risk” of a decision function
f : X → A is
R(f ) = Eℓ(f (x), y),
where (x, y) ∼ PX ×Y , and the Bayes decision function f ∗ : X → A is a function that
achieves the minimal risk among all possible functions:
Here we consider the regression setting, in which A = Y = R. We will show for the square
2
loss ℓ(a, y) = (a − y) , the Bayes decision function is f ∗ (x) = E [y | x], where the expectation
is over y. As before, we assume we know the data-generating distribution PX ×Y .
(a) We’ll approach this problem by finding the optimal action for any given x. If somebody
tells us x, we know that the corresponding y is coming from the conditional distribution
y | x. For a particular x, what value should we predict (i.e. what action a should we
produce) that has minimal expected loss? Express your answer as a decision function
f (x), which gives the best action
for any given x. In mathematical notation, we’re
2
looking for f ∗ (x) = arg mina E (a − y) | x , where the expectation is with respect to
y. (Hint: There is really nothing to do here except write down the answer, based on the
previous question. But make sure you understand what’s happening...)
(b) In the previous problem we produced a decision function f ∗ (x) that minimized the risk
for each x. In other words, for any other decision function f (x), f ∗ (x) is going to be at
least as good as f (x), for every single x. In math, we mean
2 2
E (f ∗ (x) − y) | x ≤ E (f (x) − y) | x ,
for all x. To show that f ∗ (x) is the Bayes decision function, we need to show that
2 2
E (f ∗ (x) − y) ≤ E (f (x) − y)
for any f . Explain why this is true. (Hint: Law of iterated expectations.)