Midterm Aut2014 (Final) Sol
Midterm Aut2014 (Final) Sol
Midterm Aut2014 (Final) Sol
Name of Student:
SUNetID: @stanford.edu
The Stanford University Honor Code:
I attest that I have not given or received aid in this examination,
and that I have done my share and taken an active part in seeing
to it that others as well as myself uphold the spirit and letter of the
Honor Code.
Signed:
CS229 Midterm 2
m
X
J(θ) = (hθ (x(i) ) − y (i) )2 = (Xθ − ~y )T (Xθ − ~y )
i=1
The goal of least squares regression is to find θ such that we minimize J(θ) given the
training data.
Let’s say that we had an original set of n features, so that the training inputs were
represented by the design matrix X ∈ Rm×(n+1) . However, we now gain access to one
additional feature for every example. As a result, we now have an additional vector
of features ~v ∈ Rm×1 for our training set that we wish to include in our regression.
We can do this by creating a new design matrix: X e = [X ~v ] ∈ Rm×(n+2) .
θ
Therefore the new parameter vector is θnew = where p ∈ R is the parameter
p
corresponding to the new feature vector ~v .
Note: For mathematical simplicity, throughout this problem you can assume that
X T X = I ∈ R(n+1)×(n+1) and X eT Xe = I ∈ R(n+2)×(n+2) , ~v T ~v = 1. This is called an
orthonormality assumption – specifically, the columns of X e are orthonormal. The
conclusions of the problem hold even if we do not make this assumption, but this will
make your derivations easier.
(a) [2 points] Let θ̂ = arg minθ J(θ) be the minimizer of the original least squares
objective (using the original design matrix X). Using the orthornormality as-
sumption, show that J(θ̂) = (XX T ~y − ~y )T (XX T ~y − ~y ). I.e., show that this is
the value of minθ J(θ) (the value of the objective at the minimum).
Answer: We know from lecture that the least squares minimizer is θ̂ =
(X T X)−1 X T ~y but because of the orthonormality assumption, this simplifies to
θ̂ = X T ~y . Substituting this expression into the normal equation for J(θ) gives
the final expression J(θ̂) = (XX T ~y − ~y )T (XX T ~y − ~y ).
CS229 Midterm 3
(b) [5 points] Now let θ̂new be the minimizer for J(θ e new − ~y )T (Xθ
e new ) = (Xθ e new − ~y ).
Find the new minimized objective J( e θ̂new ) and write this expression in the form:
J(
e θ̂new ) = J(θ̂) + f (X, ~v , ~y ) where J(θ̂) is as derived in part (a) and f is some
function of X, ~v , and ~y .
Answer: Just like we had in part (a), the minimizer for the new objective is
T
θ̂new = X ~y . Now we solve for the new minimized objective:
e
J( e θ̂new − ~y )T (X
e θ̂new ) = (X e θ̂new − ~y )
= (X
eX e T ~y − ~y )T (X
eX e T ~y − ~y )
= ((XX T + ~v~v T )~y − ~y )T ((XX T + ~v~v T )~y − ~y )
= ((XX T ~y − ~y ) + ~v~v T ~y )T ((XX T ~y − ~y ) + ~v~v T ~y )
= (XX T ~y − ~y )T (XX T ~y − ~y ) + 2(XX T ~y − ~y )T (~v~v T ~y ) + (~v~v T ~y )T (~v~v T ~y )
= J(θ̂) + 2(XX T ~y − ~y )T (~v~v T ~y ) + (~v~v T ~y )T (~v~v T ~y )
(1)
CS229 Midterm 4
(c) [6 points] Prove that the optimal objective value does not increase upon adding
a feature to the design matrix. That is, show J( e θ̂new ) ≤ J(θ̂).
Answer: Using the final result of part (b), we can continue simplifying the
expression for J(θ̂new ) as follows:
From the third to last equality to the second to last equality, we use the two facts
that X T v = 0 and v T v = 1.
(d) [3 points] Does the above result show that if we keep increasing the number of
features, we can always get a model that generalizes better than a model with
fewer features? Explain why or why not.
Answer: The result shows that we can either maintain or decrease the minimized
square error objective by adding more features. However, remember that the error
objective is computed only on the training samples and not the true data distribution.
As a result, reducing training error does not guarantee a reduction in error on the
true distribution. In fact, after a certain point adding features will likely lead to
overfitting, increasing our generalization error. Therefore, adding features does not
actually always result in a model that generalizes better.
CS229 Midterm 5
(a) [7 points] Consider the multinomial event model of Naive Bayes. Our goal in
this problem is to show that this is a linear classifier.
For a given text document x, let c1 , ..., cV indicate the number of times each
word (out of V words) appears in the document. Thus, ci ∈ {0, 1, 2, . . .} counts
the occurrences of word i. Recall that the Naive Bayes model uses parameters
φy = p(y = 1), φi|y=1 = p(word i appears in a specific document position | y =
1) and φi|y=0 = p(word i appears in a specific document position | y = 0).
We say a classifier is linear if it assigns a label y = 1 using a decision rule of the
form
V
X
w i ci + b ≥ 0
i=1
(b) [7 points] In Problem Set 1, you showed that Gaussian Discriminant Analysis
(GDA) is a linear classifier. In this problem, we will show that a modified version
of GDA has a quadratic decision boundary.
Recall that GDA models p(x|y) using a multivariate normal distribution, where
(x|y = 0) ∼ N (µ0 , Σ) and (x|y = 1) ∼ N (µ1 , Σ), where we used the same Σ for
both Gaussians. For this question, we will instead use two covariance matrices
Σ0 , Σ1 for the two labels. So, (x|y = 0) ∼ N (µ0 , Σ0 ) and (x|y = 1) ∼ N (µ1 , Σ1 ).
Let’s follow a binary decision rule, where we predict y = 1 if p(y = 1|x) ≥ p(y =
0|x), and y = 0 otherwise. Show that if Σ0 6= Σ1 , then the separating boundary
is quadratic in x.
That is, simplify the decision rule “p(y = 1|x) ≥ p(y = 0|x)” to the form
“xT Ax + B T x + C ≥ 0” (supposing that x ∈ Rn+1 ), for some A ∈ R(n+1)×(n+1) ,
B ∈ Rn+1 , C ∈ R and A 6= 0. Please clearly state your values for A, B and C.
CS229 Midterm 8
p(y = 1|x)
0 ≤ log
p(y = 0|x)
p(y = 1)p(x|y = 1)
0 ≤ log
p(y = 0)p(x|y = 0)
φ |Σ |1/2
1
0 ≤ log − log
1−φ |Σ0 |1/2
1
T −1 T −1
− (x − µ1 ) Σ1 (x − µ1 ) − (x − µ0 ) Σ0 (x − µ0 )
2
1 T −1
0≤− x (Σ1 − Σ−1 T −1 T −1
0 )x − 2(µ1 Σ1 − µ0 Σ0 )x
2
φ |Σ |1/2
1
+ µT1 Σ−1
1 µ 1 − µ T −1
Σ
0 0 µ 0 + log − log
1−φ |Σ0 |1/2
1
0 ≤xT (Σ−1 − Σ −1
) x + µ T −1
Σ − µ T −1
Σ x
2 0 1 1 1 0 0
φ |Σ |1/2 1
0 T −1 T −1
+ log + log + µ Σ µ0 − µ1 Σ1 µ1
1−φ |Σ1 |1/2 2 0 0
p(y; φ) = (1 − φ)y−1 φ
This distribution is known as the geometric distribution, and is used to model network
connections and many other problems.
b(y) = 1,
η = log (1 − φ),
T (y) = y,
1−φ
α(η) = log ,
φ
φ = 1 − eη
CS229 Midterm 10
ii. [5 points] Suppose that we have an IID training set {(x(i) , y (i) ), i = 1, ..., m}
and we wish to model this using Qm a GLM based on a geometric distribution.
(i) (i)
Find the log-likelihood log i=1 p(y |x ; θ) defined with respect to the
entire training set.
Answer: We calculate the log-likelihood for 1 sample as well as for the
entire training set:
(i)
log p(y (i) |x(i) ; θ) = log (1 − φ)y −1 φ
= (y (i) − 1) log (1 − φ) + log φ
1−φ
= (log (1 − φ))y (i) − log
φ
η
e
= y (i) log eη − log
1 − eη
= ηy (i) − η + log (1 − eη )
= θT x(i) y (i) − θT x(i) + log (1 − exp (θT x(i) ))
= θT x(i) (y (i) − 1) + log (1 − exp (θT x(i) ))
(b) [6 points] Derive the Hessian H and the gradient vector of the log likelihood
with respect to θ, and state what one step of Newton’s method for maximizing
the log likelihood would be.
Answer: To apply Newton’s method, we need to find the gradient and Hessian
of the log-likelihood:
m
X
∇θ l(θ) = ∇θ θT x(i) y (i) − θT x(i) + log (1 − exp (θT x(i) ))
i=1
m
X
(i) (i) x(i) exp (θT x(i) )
= x (y − 1) −
i=1
(1 − exp (θT x(i) ))
m
X 1
= y (i) − T x(i) ))
x(i)
i=1
(1 − exp (θ
H = ∇θ ∇θ (l(θ))T
m
X 1
= −∇θ x(i)T
i=1
(1 − exp (θT x(i) ))
m
X exp (θT x(i) )
=− x(i) x(i)T
i=1
(1 − exp (θT x(i) ))2
The Newton’s method update rule is then: θ := θ − H −1 ∇θ l(θ)
CS229 Midterm 12
(c) [2 points] Show that the Hessian is negative semi-definite. This shows the op-
timization objective is concave, and hence Newton’s method is maximizing log-
likelihood.
Answer:
m
X exp (θT x(i) )
z T Hz = − z T x(i) x(i)T z
i=1
(1 − exp (θT x(i) ))2
m
X exp (θT x(i) )
=− kz T x(i) k2 ≤ 0
i=1
(1 − exp (θT x(i) ))2
minw,b 21 kwk2
s.t. y (i) − wT x(i) − b ≤ i = 1, . . . , m (1)
wT x(i) + b − y (i) ≤ i = 1, . . . , m (2)
where > 0 is a given, fixed value. Notice how the original functional margin
constraint has been modified to now represent the distance between the continuous
y and our hypothesis’ output.
(a) [4 points] Write down the Lagrangian for the optimization problem above. We
suggest you use two sets of Lagrange multipliers αi and αi∗ , corresponding to the
two inequality constraints (labeled (1) and (2) above), so that the Lagrangian
would be written L(w, b, α, α∗ ).
Answer:
minw,b 12 kwk2
s.t. y (i) − wT x(i) − b ≤ i = 1, . . . , m (1)
wT x(i) + b − y (i) ≤ i = 1, . . . , m (2)
Let αi , αi∗ ≥ 0 (i = 1, . . . , m) be the Lagrange multiplier for (1)-(4) respectively.
Then, the Lagrangian can be written as:
L(w, b, α, α∗ )
= 12 kwk2
X m
− αi ( − y (i) + wT x(i) + b)
i=1
m
X
− αi∗ ( + y (i) − wT x(i) − b)
i=1
CS229 Midterm 14
(b) [10 points] Derive the dual optimization problem. You will have to take deriva-
tives of the Lagrangian with respect to w and b.
Answer:
First, the dual objective function can be written as:
Now, taking the derivatives of Lagrangian with respect to all primal variables, we
have: m
X
∂w L = w − (αi − αi∗ )x(i) = 0
i=1
m
X
∂b L = (αi∗ − αi ) = 0
i=1
Substituting the above two relations back into the Lagrangian, we have:
m
X m
X
θD (α, α∗ ) = 21 kwk2 − (αi + αi∗ ) + y (i) (αi − αi∗ )
i=1 i=1
m
X m
X
+b (αi∗ − αi ) + (αi∗ − αi )wT x(i)
i=1 i=1
m
X m
X m
X
∗
θD (α, α ) = 1
2
kwk2 − (αi + αi∗ ) + (i)
y (αi − αi∗ ) + (αi∗ − αi )wT x(i)
i=1 i=1 i=1
m
X m
X m
X
∗ (i) 2 ∗
1
= 2k (αi − αi )x k − (αi − αi ) (αj − αj∗ )x(j)T x(i)
i=1 i=1 j=1
m
X m
X
− (αi + αi∗ ) + y (i) (αi − αi∗ )
i=1 i=1
m
X m
X m
X
= − 12 (αi − αi∗ )(αj − αj∗ )x(i)T x(j) − (αi + αi∗ ) + y (i) (αi − αi∗ )
i=1,j=1 i=1 i=1
(c) [4 points] Show that this algorithm can be kernelized. For this, you have to show
that (i) the dual optimization objective can be written in terms of inner-products
of training examples; and (ii) at test time, given a new x the hypothesis hw,b (x)
can also be computed in terms of inner products.
Answer:
This algorithm can be kernelized because when making prediction at x, we have:
m
X m
X
f (w, x) = wT x + b = (αi − αi∗ )x(i)T x + b = (αi − αi∗ )k(x(i) , x) + b
i=1 i=1
Let δ > 0 be some fixed constant, and consider a finite hypothesis class H of size
|H| = k. For each h ∈ H, let ε̂(h) denote the training error of h with respect to some
training set of m IID examples, and let ĥ = arg minh∈H ε̂(h) denote the hypothesis
that minimizes training error.
Now, consider the following algorithm:
1. Set
r
1 2k
γ := log .
2m δ
Intuitively, the algorithm works by comparing the training error of h0 to the training
error of the hypothesis ĥ with the minimum training error, and returns NO or YES
only when ε̂(h0 ) is either significantly larger than or significantly smaller than ε̂(ĥ)+η.
(a) [6 points] First, show that if ε(h0 ) ≤ ε(h∗ ) + η (i.e., h0 is η-optimal), then the
probability that the algorithm returns NO is at most δ.
CS229 Midterm 17
ε̂(h0 ) ≤ ε(h0 ) + γ
≤ ε(h∗ ) + η + γ
≤ ε(ĥ) + η + γ
≤ ε̂(ĥ) + η + 2γ.
Here, the first and last inequalities follow from the fact that under the stated uniform
convergence conditions, all hypotheses in H have empirical errors within γ of their
true generalization errors. The second inequality follows from our assumption, and
the third inequality follows from the fact that h∗ minimizes the true generalization
error. Therefore, the reverse condition, ε̂(h0 ) > ε̂(ĥ)+η+2γ, occurs with probability
at most δ.
CS229 Midterm 18
(b) [6 points] Second, show that if ε(h0 ) > ε(h∗ ) + η (i.e., h0 is not η-optimal), then
the probability that the algorithm returns YES is at most δ.
Answer:
Suppose that ε(h0 ) > ε(h∗ ) + η. Using the Hoeffding inequality, we have that for
r
1 2k
γ= log
2m δ
then with probability at least 1 − δ,
ε̂(h0 ) ≥ ε(h0 ) − γ
> ε(h∗ ) + η − γ
≥ ε̂(h∗ ) + η − 2γ
≥ ε̂(ĥ) + η − 2γ.
Here, the first and third inequalities follow from the fact that under the stated
uniform convergence conditions, all hypotheses in H have empirical errors within γ of
their true generalization errors. The second inequality follows from our assumption,
and the last inequality follows from the fact that ĥ minimizes the empirical error.
Therefore, the reverse condition, ε̂(h0 ) < ε̂(ĥ) + η − 2γ occurs with probability at
most δ.
CS229 Midterm 19
(c) [8 points] Finally, suppose that h0 = h∗ , and let η > 0 and δ > 0 be fixed. Show
that if m is sufficiently large, then the probability that the algorithm returns
YES is at least 1 − δ.
ε̂(h0 ) ≤ ε(h0 ) + γ
= ε(h∗ ) + γ
≤ ε(ĥ) + γ
≤ ε̂(ĥ) + 2γ.
Here, the first and last inequalities follow from the fact that under the stated uniform
convergence conditions, all hypotheses in H have empirical errors within γ of their
true generalization errors. The equality in the second step follows from our assump-
tion, and the inequality in the third step follows from the fact that h∗ minimizes the
true generalization error. But, observe that for fixed η and δ, as m → ∞, we have
r
1 2k
γ= log → 0.
2m δ
This implies that for m sufficiently large, 4γ < η, or equivalently, 2γ < η − 2γ. It
follows that with probability at least 1 − δ, if m is sufficiently large, then
(a) [3 points] You have an implementation of Newton’s method and gradient de-
scent. Suppose that one iteration of Newton’s method takes twice as long as
one iteration of gradient descent. Then, this implies that gradient descent will
converge to the optimal objective faster. True/False?
(b) [3 points] A stochastic gradient descent algorithm for training logistic regression
with a fixed learning rate will always converge to exactly the optimal setting of
the parameters θ∗ = arg maxθ m (i) (i)
Q
i=1 p(y |x ; θ), assuming a reasonable choice
of the learning rate. True/False?
Answer: False. A fixed learning rate means that we are always taking a finite step
towards improving the log-probability of any single training example in the update
equation. Unless the examples are somehow aligned, we will continue jumping from
side to side of the optimal solution, and will not be able to get arbitrarily close to
it. The learning rate has to approach to zero in the course of the updates for the
weights to converge robustly.
CS229 Midterm 21
K(x,y)
(c) [3 points] Given a valid kernel K(x, y) over Rm , is Knorm (x, y) = √ a
K(x,x)K(y,y)
valid kernel?
Φ(x)T Φ(y)
Knorm (x, y) = p = Ψ(x)T Ψ(y)
T T
(Φ(x) Φ(x))(Φ(y) Φ(y))
with
Φ(x)
Ψ(x) = p
Φ(x)T Φ(x)
So Knorm (as normalized) is a valid kernel.
Answer: No we cannot, since the decision boundary is linear. Let the labels be
represented by y. See that we cannot classify in the case y (1) = +1, y (2) = y (3) =
−1, y (4) = +1.
CS229 Midterm 22
(e) [3 points] For linear hypotheses (i.e. of the form h(x) = wT x + b), the vec-
tor of learned weights w is always perpendicular to the separating hyperplane.
True/False? Provide a counterexample if False, or a brief explanation if True.
Answer: True. For a linear separating boundary, the hyperplane is defined by the
set {x|wT x = −b}. The inner product wT x geometrically represents the projection
of x onto w. The set of all points whose projection onto w is constant (−b) forms
a line that must be perpendicular to w. So h is perpendicular to w. This fact is
necessary in the formulation of geometric margins for the linear SVM.
(g) [6 points] Suppose you would like to use a linear regression model in order
to predict the price of houses. In your model, you use the features x0 = 1,
x1 = size in square meters, x2 = height of roof in meters. Now, suppose a friend
repeats the same analysis using exactly the same training set, only he represents
the data instead using features x00 = 1, x01 = x1 , and x02 = height in cm (so
x02 = 100x2 ).
i. [3 points] Suppose both of you run linear regression, solving for the pa-
rameters via the Normal equations. (Assume there are no degeneracies, so
this gives a unique solution to the parameters.) You get parameters θ0 , θ1 ,
θ2 ; your friend gets θ00 , θ10 , θ20 . Then θ00 = θ0 , θ10 = θ1 , θ20 = 100
1
θ2 . True/False?
Answer: True. Observe that running a single step of Newton’s method, for
a linear regression problem, is equivalent to solving the Normal equations. The
result then follows from the invariance of Newton’s method to linear reparam-
eterizations.
ii. [3 points] Suppose both of you run linear regression, initializing the parame-
ters to 0, and compare your results after running just one iteration of batch
gradient descent. You get parameters θ0 , θ1 , θ2 ; your friend gets θ00 , θ10 , θ20 .
Then θ00 = θ0 , θ10 = θ1 , θ20 = 100
1
θ2 . True/False?
Answer: False. Recall that gradient descent is not invariant to linear repa-
rameterizations.