CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
So
m
∂ 2 J(θ) X ∂ (i)
= (θT x(i) − y (i) )xj
∂θj ∂θk i=1
∂θ k
m
(i) (i)
X
= xj xk = (X T X)jk
i=1
Therefore, the Hessian of J(θ) is H = X T X. This can also be derived by simply applying
rules from the lecture notes on Linear Algebra.
(b) Show that the first iteration of Newton’s method gives us θ⋆ = (X T X)−1 X T ~y , the
solution to our least squares problem.
Answer: Given any θ(0) , Newton’s method finds θ(1) according to
Therefore, no matter what θ(0) we pick, Newton’s method always finds θ⋆ after one
iteration.
∇θ ℓ(θ) = X T z − λθ
where z ∈ Rm is defined by
zi = w(i) (y (i) − hθ (x(i) ))
and the Hessian is given by
H = X T DX − λI
where D ∈ Rm×m is a diagonal matrix with
For the sake of this problem you can just use the above formulas, but you should try to
derive these results for yourself as well.
Given a query point x, we choose compute the weights
(a) Implement the Newton-Raphson algorithm for optimizing ℓ(θ) for a new query point
x, and use this to predict the class of x.
The q2/ directory contains data and code for this problem. You should implement
the y = lwlr(X train, y train, x, tau) function in the lwlr.m file. This func-
tion takes as input the training set (the X train and y train matrices, in the form
described in the class notes), a new query point x and the weight bandwitdh tau.
Given this input the function should 1) compute weights w(i) for each training exam-
ple, using the formula above, 2) maximize ℓ(θ) using Newton’s method, and finally 3)
output y = 1{hθ (x) > 0.5} as the prediction.
We provide two additional functions that might help. The [X train, y train] =
load data; function will load the matrices from files in the data/ folder. The func-
tion plot lwlr(X train, y train, tau, resolution) will plot the resulting clas-
sifier (assuming you have properly implemented lwlr.m). This function evaluates the
locally weighted logistic regression classifier over a large grid of points and plots the
resulting prediction as blue (predicting y = 0) or red (predicting y = 1). Depending
on how fast your lwlr function is, creating the plot might take some time, so we
recommend debugging your code with resolution = 50; and later increase it to at
least 200 to get a better idea of the decision boundary.
Answer: Our implementation of lwlr.m:
function y = lwlr(X_train, y_train, x, tau)
m = size(X_train,1);
n = size(X_train,2);
CS229 Problem Set #1 Solutions 3
theta = zeros(n,1);
% compute weights
w = exp(-sum((X_train - repmat(x’, m, 1)).^2, 2) / (2*tau));
% return predicted y
y = double(x’*theta > 0);
(b) Evaluate the system with a variety of different bandwidth parameters τ . In particular,
try τ = 0.01, 0.050.1, 0.51.0, 5.0. How does the classification boundary change when
varying this parameter? Can you predict what the decision boundary of ordinary
(unweighted) logistic regression would look like?
Answer: These are the resulting decision boundaries, for the different values of τ .
For smaller τ , the classifier appears to overfit the data set, obtaining zero training error,
but outputting a sporadic looking decision boundary. As τ grows, the resulting deci-
sion boundary becomes smoother, eventually converging (in the limit as τ → ∞ to the
unweighted linear regression solution).
Thus for each training example, y (i) is vector-valued, with p entries. We wish to use a linear
model to predict the outputs, as in least squares, by specifying the parameter matrix Θ in
y = ΘT x,
where Θ ∈ Rn×p .
Write J(Θ) in matrix-vector notation (i.e., without using any summations). [Hint:
Start with the m × n design matrix
— (x(1) )T —
— (x(2) )T —
X= ..
.
— (x(m) )T —
(y (1) )T
— —
— (y (2) )T —
Y = ..
.
— (y (m) )T —
and then work out how to express J(Θ) in terms of these matrices.]
Answer: The objective function can be expressed as
1
tr (XΘ − Y )T (XΘ − Y ) .
J(Θ) =
2
To see this, note that
1
tr (XΘ − Y )T (XΘ − Y )
J(Θ) =
2
1X
XΘ − Y )T (XΘ − Y ) ii
=
2 i
1 XX
= (XΘ − Y )2ij
2 i j
m p
1 X X T (i) (i)
2
= (Θ x )j − yj
2 i=1 j=1
CS229 Problem Set #1 Solutions 5
(b) Find the closed form solution for Θ which minimizes J(Θ). This is the equivalent to
the normal equations for the multivariate case.
Answer: First we take the gradient of J(Θ) with respect to Θ.
1 T
∇Θ J(Θ) = ∇Θ tr (XΘ − Y ) (XΘ − Y )
2
1 T T T T T T
= ∇Θ tr Θ X XΘ − Θ X Y − Y XΘ − Y T
2
1
∇Θ tr(ΘT X T XΘ) − tr(ΘT X T Y ) − tr(Y T XΘ) + tr(Y T Y )
=
2
1
∇Θ tr(ΘT X T XΘ) − 2tr(Y T XΘ) + tr(Y T Y )
=
2
1 T
X XΘ + X T XΘ − 2X T Y
=
2
= X T XΘ − X T Y
Setting this expression to zero we obtain
Θ = (X T X)−1 X T Y.
This looks very similar to the closed form solution in the univariate case, except now Y
is a m × p matrix, so then Θ is also a matrix, of size n × p.
(c) Suppose instead of considering the multivariate vectors y (i) all at once, we instead
(i)
compute each variable yj separately for each j = 1, . . . , p. In this case, we have a p
individual linear models, of the form
(i)
yj = θjT x(i) , j = 1, . . . , p.
(So here, each θj ∈ Rn ). How do the parameters from these p independent least
squares problems compare to the multivariate solution?
Answer: This time, we construct a set of vectors
(1)
yj
(2)
yj
~yj =
.. , j = 1, . . . , p.
.
(m)
yj
Then our j-th linear model can be solved by the least squares solution
θj = (X T X)−1 X T ~yj .
If we line up our θj , we see that we have the following equation:
[θ1 θ2 · · · θp ] = (X T X)−1 X T ~y1 (X T X)−1 X T ~y2 · · · (X T X)−1 X T ~yp
4. Naive Bayes
In this problem, we look at maximum likelihood parameter estimation using the naive
Bayes assumption. Here, the input features xj , j = 1, . . . , n to our model are discrete,
binary-valued variables, so xj ∈ {0, 1}. We call x = [x1 x2 · · · xn ]T to be the input vector.
For each training example, our output targets are a single binary-value y ∈ {0, 1}. Our
model is then parameterized by φj|y=0 = p(xj = 1|y = 0), φj|y=1 = p(xj = 1|y = 1), and
φy = p(y = 1). We model the joint distribution of (x, y) according to
Qm
(a) Find the joint likelihood function ℓ(ϕ) = log i=1 p(x(i) , y (i) ; ϕ) in terms of the
model parameters given above. Here, ϕ represents the entire set of parameters
{φy , φj|y=0 , φj|y=1 , j = 1, . . . , n}.
Answer:
m
Y
ℓ(ϕ) = log p(x(i) , y (i) ; ϕ)
i=1
Ym
= log p(x(i) |y (i) ; ϕ)p(y (i) ; ϕ)
i=1
m n
(i)
Y Y
= log p(xj |y (i) ; ϕ) p(y (i) ; ϕ)
i=1 j=1
m n
(i)
X X
= log p(y (i) ; ϕ) + log p(xj |y (i) ; ϕ)
i=1 j=1
m
"
X
= y (i) log φy + (1 − y (i) ) log(1 − φy )
i=1
n
(i) (i)
X
+ xj log φj|y(i) + (1 − xj ) log(1 − φj|y(i) )
j=1
(b) Show that the parameters which maximize the likelihood function are the same as
CS229 Problem Set #1 Solutions 7
Answer: The only terms in ℓ(ϕ) which have non-zero gradient with respect to φj|y=0
are those which include φj|y(i) . Therefore,
m
(i) (i)
X
∇φj|y=0 ℓ(ϕ) = ∇φj|y=0 xj log φj|y(i) + (1 − xj ) log(1 − φj|y(i) )
i=1
m
(i)
X
= ∇φj|y=0 xj log(φj|y=0 )1{y (i) = 0}
i=1
(i)
+ (1 − xj ) log(1 − φj|y=0 )1{y (i) = 0}
m
X (i) 1 (i) 1
= xj 1{y (i) = 0} − (1 − xj ) 1{y (i) = 0} .
i=1
φj|y=0 1 − φj|y=0
To solve for φy ,
m
X
∇φy ℓ(ϕ) = ∇φy y (i) log φy + (1 − y (i) ) log(1 − φy )
i=1
m
X 1 1
= y (i) − (1 − y (i) )
i=1
φy 1 − φy
Therefore, Pm
1{y (i) = 1}
i=1
φy = .
m
(c) Consider making a prediction on some new data point x using the most likely class
estimate generated by the naive Bayes algorithm. Show that the hypothesis returned
by naive Bayes is a linear classifier—i.e., if p(y = 0|x) and p(y = 1|x) are the class
probabilities returned by naive Bayes, show that there exists some θ ∈ Rn+1 such
that
1
p(y = 1|x) ≥ p(y = 0|x) if and only if θT ≥ 0.
x
(Assume θ0 is an intercept term.)
Answer:
p(y = 1|x) ≥ p(y = 0|x)
p(y = 1|x)
⇐⇒ ≥1
p(y = 0|x)
Q
n
j=1 p(x j |y = 1) p(y = 1)
⇐⇒ Q ≥1
n
j=1 p(xj |y = 0) p(y = 0)
Q
n xj 1−xj
j=1 (φ j|y=0 ) (1 − φ j|y=0 ) φy
⇐⇒ Q ≥1
n xj (1 − φ 1−xj (1 − φ )
(φ
j=1 j|y=1 ) j|y=1 ) y
n
φj|y=1 1 − φj|y=0
X φy
⇐⇒ xj log + (1 − xj ) log + log ≥0
j=1
φj|y=0 1 − φj |y = 0 1 − φy
n n
(φj|y=1 )(1 − φj|y=0 ) 1 − φj|y=1
X
X φy
⇐⇒ xj log + log + log ≥0
j=1
(φj|y=0 )(1 − φj|y=1 ) j=1
1 − φj|y=0 1 − φy
T 1
⇐⇒ θ ≥ 0,
x
CS229 Problem Set #1 Solutions 9
where
n
1 − φj|y=1
X φy
θ0 = log + log
j=1
1 − φj|y=0 1 − φy
(φj|y=1 )(1 − φj|y=0 )
θj = log , j = 1, . . . , n.
(φj|y=0 )(1 − φj|y=1 )
p(y; φ) = (1 − φ)y−1 φ, y = 1, 2, 3, . . . .
Show that the geometric distribution is in the exponential family, and give b(y), η,
T (y), and a(η).
Answer:
p(y; φ) = (1 − φ)y−1 φ
exp log(1 − φ)y−1 + log φ
=
= exp [(y − 1) log(1 − φ) + log φ]
1−φ
= exp y log(1 − φ) − log
φ
Then
b(y) = 1
η = log(1 − φ)
T (y) = y
η
1−φ e
a(η) = log = log ,
φ 1 − eη
(c) For a training set {(x(i) , y (i) ); i = 1, . . . , m}, let the log-likelihood of an example
be log p(y (i) |x(i) ; θ). By taking the derivative of the log-likelihood with respect to
θj , derive the stochastic gradient ascent rule for learning using a GLM model with
goemetric responses y and the canonical response function.
Answer: The log-likelihood of an example (x(i) , y (i) ) is defined as ℓ(θ) = log p(y (i) |x(i) ; θ).
To derive the stochastic gradient ascent rule, use the results from previous parts and the
standard GLM assumption that η = θT x.
CS229 Problem Set #1 Solutions 10
" T (i)
!!#
T (i) (i) eθ x
ℓi (θ) = log exp θ x ·y − log
1 − eθT x(i)
T (i) (i) 1
= log exp θ x · y − log
e−θT x(i) − 1
T (i)
= θT x(i) · y (i) + log e−θ x − 1
T
x(i)
∂ (i) e−θ (i)
ℓi (θ) = xj y (i) + (−xj )
∂θj e−θT x(i) − 1
(i) 1 (i)
= xj y (i) − x
1 − e T x(i) j
−θ
(i) 1 (i)
= y − xj .
1 − eθT x(i)
Thus the stochastic gradient ascent update rule should be
∂ℓi (θ)
θj := θj + α ,
∂θj
which is
1 (i)
θj := θj + α y (i) − xj .
1 − eθT x(i)