Ps 4
Ps 4
Ps 4
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Dec 08 at 11:59 pm. If you
submit after Dec 05, you will begin consuming your late days. If you wish to submit on time,
submit before Dec 05 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p01 nn.py, p04 ica.py and p06 cartpole.py. Your
submission will be evaluated by the auto-grader using a private test set.
CS229 Problem Set #4 2
The data for this problem can be found in the data folder as images train.csv, images test.csv,
labels train.csv and labels test.csv.
The code for this assignment can be found within p01 nn.py within the src folder.
The starter code splits the set of 60,000 training images and labels into a sets of 59,600 examples
as the training set and 400 examples for dev set.
To start, you will implement a simple convolutional neural network and cross entropy loss, and
train it with the provided data set.
The architecture is as follows:
(a) The first layer is a convolutional layer with 2 output channels with a convolution size of 4
by 4.
(b) The second layer is a max pooling layer of stride and width 5 by 5.
(c) The third layer is a ReLU activation layer.
(d) After the four layer, the data is flattened into a single dimension.
(e) The fith layer is a single linear layer with output size 10 (the number of classes).
(f) The sixth layer is a softmax layer that computes the probabilities for each class.
(g) Finally, we use a cross entropy loss as our loss function.
We have provided all of the forward functions for these different layers so there is an unambigious
definition of them in the code. Your job in this assignment will be to implement functions that
1 https://commons.wikimedia.org/wiki/File:MnistExamples.png
CS229 Problem Set #4 3
compute the gradients for these layers. However, here is some additional text that might be
helpful in understanding the forward functions.
We have discussed convolutional layers on the exam, but as a review, the following equation
defines what we mean by a 2d convolution:
output[out channel, x, y] = max input[in channel, x ∗ pool width + di, y ∗ pool height + dy]
di,dj
The ReLU (rectified linear unit) is our activation function. The ReLU is simply max(0, x) where
x is the input.
We use cross entropy loss as our loss function. Recall that for a single example (x, y), the cross
entropy loss is:
K
X
CE(y, ŷ) = − yk log yˆk ,
k=1
where ŷ ∈ RK is the vector of softmax outputs from the model for the training example x, and
y ∈ RK is the ground-truth vector for the training example x such that y = [0, ..., 0, 1, 0, ..., 0]>
contains a single 1 at the position of the correct class (also called a “one-hot” representation).
We are also doing mini-batch gradient descent with a batch size of 16. Normally we would
iterater over the data multiple times with multiple epochs, but for this assignment we only do
400 batches to save time.
(a) [20 points]
Implement the following functions within p01 nn.py. We recommend that you start at the
top of the list and work your way down:
i. backward softmax
ii. backward relu
iii. backward cross entropy loss
iv. backward linear
v. backward convolution
vi. backward max pool
(b) [10 points] Now implement a function that computes the full backward pass.
i. backward prop
CS229 Problem Set #4 4
E s∼p(s) R(s, a)
a∼π1 (s,a)
X
E s∼p(s) R(s, a) = R(s, a)p(s, a)
a∼π1 (s,a) (s,a)
X
= R(s, a)p(s)p(a|s)
(s,a)
X
= R(s, a)p(s)π1 (s, a)
(s,a)
Unfortunately, we cannot estimate this directly as we only have samples created under policy π0
and not π1 . For this problem, we will be looking at formulas that approximate this value using
expectations under π0 that we can actually estimate.
We will make one additional assumption that each action has a non-zero probability in the
observed policy π0 (s, a). In other words, for all actions a and states s, π0 (s, a) > 0.
Regression: The simplest possible estimator is to directly use our learned MDP parameters to
estimate our goal. This is usually called the regression estimator. While training our MDP, we
learn an estimator R̂(s, a) that estimates R(s, a). We can now directly estimate
E s∼p(s) R(s, a)
a∼π1 (s,a)
with
E s∼p(s) R̂(s, a)
a∼π1 (s,a)
(a) [2 points] Importance Sampling: One commonly used estimator is known as the impor-
tance sampling estimator. Let π̂0 be an estimate of the true π0 . The importance sampling
estimator uses that π̂0 and has the form:
π1 (s, a)
E s∼p(s) R(s, a)
a∼π0 (s,a) π̂0 (s, a)
Please show that if π̂0 = π0 , then the importance sampling estimator is equal to:
E s∼p(s) R(s, a)
a∼π1 (s,a)
Note that this estimator only requires us to model π0 as we have the R(s, a) values for the
items in the observational data.
(b) [2 points] Weighted Importance Sampling: One variant of the importance sampling es-
timator is known as the weighted importance sampling estimator. The weighted importance
sampling estimator has the form:
π1 (s,a)
E s∼p(s) π̂0 (s,a) R(s, a)
a∼π0 (s,a)
Please show that if π̂0 = π0 , then the weighted importance sampling estimator is equal to:
E s∼p(s) R(s, a)
a∼π1 (s,a)
(c) [2 points] One issue with the weighted importance sampling estimator is that it can be
biased in many finite sample situations. In finite samples, we replace the expected value
with a sum over the seen values in our observational dataset. Please show that the weighted
importance sampling estimator is biased in these situations.
Hint: Consider the case where there is only a single data element in your observational
dataset.
(d) [7 points] Doubly Robust: One final commonly used estimator is the doubly robust
estimator. The doubly robust estimator has the form:
π1 (s, a)
E s∼p(s) ((Ea∼π1 (s,a) R̂(s, a)) + (R(s, a) − R̂(s, a)))
a∼π0 (s,a) π̂0 (s, a)
One advantage of the doubly robust estimator is that it works if either π̂0 = π0 or R̂(s, a) =
R(s, a)
i. [4 points] Please show that the doubly robust estimator is equal to E s∼p(s) R(s, a)
a∼π1 (s,a)
when π̂0 = π0
ii. [3 points] Please show that the doubly robust estimator is equal to E s∼p(s) R(s, a)
a∼π1 (s,a)
when R̂(s, a) = R(s, a)
(e) [2 points] We will now consider several situations where you might have a choice between
the importance sampling estimator and the regression estimator. Please state whether the
importance sampling estimator or the regression estimator would probably work best in
each situation and explain why it would work better. In all of these situations, your states
s consist of patients, your actions a represent the drugs to give to certain patients and your
R(s, a) is the lifespan of the patient after receiving the drug.
i. [1 points] Drugs are randomly assigned to patients, but the interaction between the
drug, patient and lifespan is very complicated.
ii. [1 points] Drugs are assigned to patients in a very complicated manner, but the inter-
action between the drug, patient and lifespan is very simple.
CS229 Problem Set #4 7
Show that the unit-length vector u that minimizes the mean squared error between projected
points and original points corresponds to the first principal component for the data. I.e., show
that
Xm
arg min kx(i) − fu (x(i) )k22 .
u:uT u=1
i=1
where g is the cumulative distribution function, and g 0 is the probability density function of
the source distribution (in this sub-question it is a standard normal distribution). Whereas
in the notes we derive an update rule to train W iteratively, for the cause of Gaussian
distributed sources, we can analytically reason about the resulting W .
Try to derive a closed form expression for W in terms of X when g is the standard normal
CDF. Deduce the relation between W and X in the simplest terms, and highlight the
ambiguity (in terms of rotational invariance) in computing W .
(b) [10 points] Laplace source.
For this sub-question, we assume sources are distributed according to a standard Laplace
distribution, i.e si ∼ L(0, 1). The Laplace distribution L(0, 1) has PDF fL (s) = 12 exp (−|s|).
With this assumption, derive the update rule for a single example in the form
W := W + α (. . .) .
(c) [5 points] Cocktail Party Problem
For this question you will implement the Bell and Sejnowski ICA algorithm, but assuming
a Laplace source (as derived in part-b), instead of the Logistic distribution covered in class.
The file mix.dat contains the input data which consists of a matrix with 5 columns, with
each column corresponding to one of the mixed signals xi . The code for this question can
be found in p04 ica.py.
Implement the update W and unmix functions in p04 ica.py.
CS229 Problem Set #4 9
You can then run p04 ica.py in order to split the mixed audio into its components. The
mixed audio tracks are written to midex i.wav in the output folder. The split audio tracks
are written to split i.wav in the output folder.
To make sure your code is correct, you should listen to the resulting unmixed sources.
(Some overlap or noise in the sources may be present, but the different sources should be
pretty clearly separated.)
If you implemention is correct, your output split 0.wav should sound similar to the file
correct split 0.wav included with the source code.
Note: In our implementation, we anneal the learning rate α (slowly decreased it over
time) to speed up learning. In addition to using the variable learning rate to speed up
convergence, one thing that we also do is choose a random permutation of the training
data, and running stochastic gradient ascent visiting the training data in that order (each
of the specified learning rates was then used for one full pass through the data).
CS229 Problem Set #4 10
(a) [10 points] Prove that, for any two finite-valued vectors V1 , V2 , it holds true that
where
||V ||∞ = max |V (s)|.
s∈S
(This shows that the Bellman update operator is a “γ-contraction in the max-norm.”)
(b) [5 points] We say that V is a fixed point of B if B(V ) = V . Using the fact that the
Bellman update operator is a γ-contraction in the max-norm, prove that B has at most
one fixed point—i.e., that there is at most one solution to the Bellman equations. You may
assume that B has at least one fixed point.
Remark: The result you proved in part(a) implies that value iteration converges geometrically
to the optimal value function V ∗ . That is, after k iterations, the distance between V and V ∗ is
at most γ k .
CS229 Problem Set #4 11
3.5
2.5
1.5
0.5
−0.5
−3 −2 −1 0 1 2 3
We have written a simple simulator for this problem. The simulation proceeds in discrete time
cycles (steps). The state of the cart and pole at any time is completely characterized by 4
parameters: the cart position x, the cart velocity ẋ, the angle of the pole θ measured as its
deviation from the vertical position, and the angular velocity of the pole θ̇. Since it would
be simpler to consider reinforcement learning in a discrete state space, we have approximated
the state space by a discretization that maps a state vector (x, ẋ, θ, θ̇) into a number from 0 to
NUM STATES-1. Your learning algorithm will need to deal only with this discretized representation
of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 0 and 1 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a
new discretized state.
We will assume that the reward R(s) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given,
and the system is reinitialized randomly. At all other times, the reward is zero. Your program
must learn to balance the pole using only the state transitions and rewards observed.
The files for this problem are in src directory. Most of the the code has already been written
for you, and you need to make changes only to p06 cartpole.py in the places specified. This
file can be run to show a display and to plot a learning curve at the end. Read the comments at
the top of the file for more details on the working of the simulation.
2 The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
CS229 Problem Set #4 12
To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to
obtain a value function, and act greedily with respect to this value function.
Briefly, you will maintain a current model of the MDP and a current estimate of the value func-
tion. Initially, each state has estimated reward zero, and the estimated transition probabilities
are uniform (equally likely to end up in any other state).
During the simulation, you must choose actions at each time step according to some current
policy. As the program goes along taking actions, it will gather observations on transitions and
rewards, which it can use to get a better estimate of the MDP model. Since it is inefficient to
update the whole estimated MDP after every observation, we will store the state transitions and
reward observations each time, and update the model and value function/policy only periodically.
Thus, you must maintain counts of the total number of times the transition from state si to state
sj using action a has been observed (similarly for the rewards). Note that the rewards at any
state are deterministic, but the state transitions are not because of the discretization of the state
space (several different but close configurations may map onto the same discretized state).
Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition
probabilities and rewards as the average of the observed values (if any). Your program must then
use value iteration to solve Bellman’s equations on the estimated MDP, to get the value function
and new optimal policy for the new model. For value iteration, use a convergence criterion
that checks if the maximum absolute change in the value function on an iteration exceeds some
specified tolerance.
Finally, assume that the whole learning procedure has converged once several consecutive at-
tempts (defined by the parameter NO LEARNING THRESHOLD) to solve Bellman’s equation all con-
verge in the first iteration. Intuitively, this indicates that the estimated model has stopped
changing significantly.
The code outline for this problem is already in p06 cartpole.py, and you need to write code
fragments only at the places specified in the file. There are several details (convergence criteria
etc.) that are also explained inside the code. Use a discount factor of γ = 0.995.
Implement the reinforcement learning algorithm as specified, and run it.
• How many trials (how many times did the pole fall over or the cart fall off) did it take before
the algorithm converged? Hint: if your solution is correct, on the plot the red line indicating
smoothed log num steps to failure should start to flatten out at about 60 iterations.
• Plot a learning curve showing the number of time-steps for which the pole was balanced
on each trial. Python starter code already includes the code to plot. Include it in your
submission.
• Find the line of code that says np.random.seed, and rerun the code with the seed set to 1,
2, and 3. What do you observe? What does this imply about the algorithm?