Wa0193.

Assignment 5
Introduction to Machine Learning

Prof. B. Ravindran
1. Consider a feedforward neural network that performs regression on a p-dimensional input to
produce a scalar output. It has m hidden layers and each of these layers has k hidden units.
What is the total number of trainable parameters in the network? Ignore the bias terms.
(a) pk + mk 2
(b) pk + mk 2 + k
(c) pk + (m − 1)k 2 + k
(d) p2 + (m − 1)pk + k
(e) p2 + (m − 1)pk + k 2
Sol. (c)
Number of edges between the input layer and the first hidden layer = pk
Number of edges between the ith and (i + 1)th hidden layer is k 2 . Taking i = 1, 2, ..., m − 1,
we get (m − 1)k 2 edges.
Since the output is a scalar, there is only one neuron in the output layer. Therefore, the
number of edges between the last (i.e. mth ) hidden layer and the output layer = k.
Hence, the total number of edges = pk + (m − 1)k 2 + k. Each of these edges corresponds to a
parameter.
2. Consider a neural network layer defined as y = ReLU (W x). Here x ∈ Rp is the input,
y ∈ Rd is the output and W ∈ Rd×p is the parameter matrix. The ReLU activation (defined
∂yi
as ReLU (z) := max(0, z) for a scalar z) is applied element-wise to W x. Find ∂W ij
where
i = 1, .., d and j = 1, ..., p. In the following options, I(condition) is an indicator function that
returns 1 if the condition is true and 0 if it is false.
(a) I(yi > 0) xi

(b) I(yi > 0) xj
(c) I(yi ≤ 0) xi
(d) I(yi > 0) Wij xj
(e) I(yi ≤ 0) Wij xi
Sol. (b) Pp
We have yi = max( j=1 Wij xj , 0).
Pp ∂yi
j=1 Wij xj ≤ 0 =⇒ yi = 0 =⇒ ∂Wij = 0.
Pp Pp ∂yi
j=1 Wij xj > 0 =⇒ yi = j=1 Wij xj =⇒ ∂Wij = xj .
3. Consider a two-layered neural network y = σ(W (B) σ(W (A) x)). Let h = σ(W (A) x) denote the
hidden layer representation. W (A) and W (B) are arbitrary weights. Which of the following
statement(s) is/are true? Note: ∇g (f ) denotes the gradient of f w.r.t g.
1
(a) ∇h (y) depends on W (A) .
(b) ∇W (A) (y) depends on W (B) .
(c) ∇W (A) (h) depends on W (B) .
(d) ∇W (B) (y) depends on W (A) .
Sol. (b), (d)
Since y = σ(W (B) h), we only require W (B) and not W (A) to compute y from h.
∇W (A) (y) can be decomposed into ∇h (y) and ∇W (A) (h) using the chain rule. Of these two
components, ∇h (y) depends on W (B) .
Since h = σ(W (A) x), we only require W (A) and not W (B) to compute h from x.
∇W (B) (y) depends on h, which in turn depends on W (A) . Hence, ∇W (B) (y) depends on W (A) .
4. Which of the following statement(s) about the initialization of neural network weights is/are
true?
(a) Two different initializations of the same network could converge to different minima.
(b) For a given initialization, gradient descent will converge to the same minima irrespective
of the learning rate.
(c) The weights should be initialized to a constant value.
(d) The initial values of the weights should be sampled from a probability distribution.
Sol. (a), (d)
Since the loss surface of a neural network is highly non-convex, it has multiple local minima.
Hence, different initializations or learning rates may result in convergence to different minima.
If the weights are initialized to a constant value, all the neurons in a layer will learn similar
features. To avoid this, the initial values should be sampled from a distribution.
1
5. Consider the following statements about the derivatives of the sigmoid (σ(x) = 1+exp(−x) )
exp(x)−exp(−x)
and tanh (tanh(x) = exp(x)+exp(−x) ) activation functions. Which of these statement(s) is/are
correct?
(a) 0 < σ ′ (x) ≤ 18
(b) limx→−∞ σ ′ (x) = 0
(c) 0 < tanh′ (x) ≤ 1
(d) limx→+∞ tanh′ (x) = 1
Sol. (b), (c)
σ ′ (x) = σ(x)(1 − σ(x))
As x → −∞, we have σ(x) → 0 and (1 − σ(x)) → 1, which implies σ ′ (x) → 0.
0 < σ(x) < 1 =⇒ σ(x)(1 − σ(x)) > 0. The maximum value of σ ′ (x) is attained at x = 0 since
σ ′ (0) = 21 (1 − 12 ) = 41
tanh′ (x) = 1 − (tanh(x))2

As x → +∞, we have tanh(x) → 1 =⇒ tanh′ (x) → 0.
−1 < tanh(x) < 1 =⇒ 0 < (1 − (tanh(x))2 ) < 1. The maximum value of tanh′ (x) is attained
at x = 0 since tanh′ (0) = 1 − 02 = 1
2
6. A geometric distribution is defined by the p.m.f. f (x; p) = (1 − p)(x−1) p for x = 1, 2, ....
Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the MLE of p. Using this
estimate, find the probability of sampling x ≥ 5 from the distribution.
(a) 0.289
(b) 0.325
(c) 0.417
(d) 0.366
Sol. (d) Qn P
The likelihood function is L(p|x1 , ..., xn ) = i=1 (1 − p)(1−xi ) pxi = pn (1 − p) (xi )−n
P
log(L(p|x1 , ..., xn )) = n log(p) + ( (xi ) − n) log(1 − p)
Take the derivative of the RHS w.r.t. p and equate it to 0. On simplifying, we get p̂M L = Pn .
xi
6 6
Substituting the given values, p̂M L = 4+5+6+5+4+3 = 27 = 0.222.
P+∞ (i−1) 4
P (x ≥ 5) = i=5 (1 − p) p = (1 − p)
Substituting p̂M L , P (x ≥ 5) = (1 − 0.222)4 = 0.366
7. Consider a Bernoulli distribution with with p = 0.7 (true value of the parameter). We draw
samples from this distribution and compute an MAP estimate of p by assuming a prior distri-
bution over p. Which of the following statement(s) is/are true?
(a) If the prior is Beta(2, 6), we will likely require fewer samples for converging to the true
value than if the prior is Beta(6, 2).
(b) If the prior is Beta(6, 2), we will likely require fewer samples for converging to the true
value than if the prior is Beta(2, 6).
(c) With a prior of Beta(2, 100), the estimate will never converge to the true value, regardless
of the number of samples used.
(d) With a prior of U (0, 0.5) (i.e. uniform distribution between 0 and 0.5), the estimate will
never converge to the true value, regardless of the number of samples used.
Sol. (b), (d)

Beta(6, 2) has a much higher density than Beta(2, 6) near the true value 0.7. Thus, Beta(6, 2)
will likely require fewer samples for convergence.
Beta(2, 100) has a high density close to 0 but has a non-zero density at 0.7. Therefore, the
estimate will converge to the true value if we use a sufficiently large number of samples.
However, U (0, 0.5) has zero density at 0.7. Hence, the estimate will never converge to 0.7.
8. Which of the following statement(s) about parameter estimation techniques is/are true?
(a) To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
(b) The MAP estimate of the parameter gives a point prediction for a new data point.
(c) The MLE of a parameter gives a distribution of predicted values for a new data point.
(d) We need a point estimate of the parameter to compute a distribution of the predicted
values for a new data point.
3
Sol. (a), (b)
Option (a) is true and option (d) is false based on the equation written at 01:10 in the ”Pa-
rameter Estimation III” lecture.
Both MAP and MLE give point estimates for the parameter, leading to a point prediction on
a new data point. Hence, (b) is true and (c) is false.

Wa0193.

Uploaded by

Copyright:

Available Formats

Wa0193.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wa0193.

Uploaded by

Copyright:

Available Formats

Assignment 5

Introduction to Machine Learning

(a) I(yi > 0) xi

tanh′ (x) = 1 − (tanh(x))2

Sol. (b), (d)

You might also like