Solution Dseclzg524!01!102020 Ec2r

Birla Institute of Technology & Science, Pilani
Work Integrated Learning Programmes Division

First Semester 2020-21
M.Tech. (Data Science and Engineering)
Midsem Examination (Regular)
SOLUTION
Course No. : DSECLZG524

Course Title : DEEP LEARNING No. of Pages = 4
No. of Questions = 5
Nature of Exam : Open Book
Weightage : 30%
Duration : 2 Hours
Date of Examination : November 1, 2020 Time of Exam: 10 AM – 12 PM
Note: Assumptions made if any, should be stated clearly at the beginning of your answer.
Show your rough work to get partial credit, when appropriate.
Question 1. [1 + 1*2 + 2 + 1 =6 marks]
Consider the following DNN for image classification for a dataset that consists of RGB
images of size 32x32.
model = models.Sequential()
# Layer 1
model.add(layers.Dense(50, activation='relu',input_shape=**A**))
# Layer 2
model.add(layers.Dense(40, activation='relu'))
# Layer 3
model.add(layers.Dense(30, activation='relu'))
# Layer 4
model.add(layers.Dense(**B**, activation=**C**))
model.compile(optimizer = 'sgd', loss = **D**, metrics=['accuracy
'])
A. What is the input shape **A** in Layer 1? (32*32*3,) or (3072,)

B. What will be the value of **B**, activation function **C** and loss **D** if the
total number of classes in the dataset is
i. 2 1, sigmoid, binary_crossentropy
ii. 10 10, softmax, categorical_crossentropy
C. What will be the total number of parameters in Layer 1, Layer 2 and Layer 3? If a
dropout layer of value 0.5 is added after Layer 2, what will be the change in the
number of parameters?
Layer 1- 3072*50 + 50 = 153,650
Layer 2- 50*40 + 40 = 2040
Layer 3- 40*30 + 30 = 1230
Total = 153650 + 2040 + 1230 = 156,920
No change in the number of parameters if dropout is added.
Page 1 of 6
D. What is the difference between kernel regularizers and activity regularizers?
kernel_regularizer: Regularizer to apply a penalty on the layer's kernel.
activity_regularizer: Regularizer to apply a penalty on the layer's output.
Question 2. [3+3+1=7 Marks]

Refer to the following multilayer perceptron network with two hidden layers. All nodes use a
step activation function, i.e.,
output = +1 if total input >= bias
output = -1 if total input < bias
Bias at each node is indicated inside the node.
A. What will be the output y, if (x1,x2)=(2,3), (x1,x2)=(6,3), and (x1,x2)=(4,3)

-1, -1, -1
B. Express output y as a decision rule of input (x1, x2). Recall, decision rule is expressed
as an if-then-else statement, e.g., y = (x1> 2) OR (x2 < 7) AND ( x1 + x2 = 7)
For any (x1,x2) total input to the outnode node is at most 0. So, y = -1
C. Can this decision rule be realized using a multilayer perceptron with one hidden
layer? If no, why? If yes, how many hidden node will be needed in that one hidden
layer?
One hidden layer will be sufficient since output is always -1 irrespective of the input.
One hidden node is enough. Actually no hidden layer is needed.
Question 3. [1+1+0.75*6=5 Marks]
Page 2 of 6
A. Calculate the actual output (y1, y2) for the current iteration with input (x1, x2) = (1, 1).
y1= 1 / (1+e) y2= e / (1+e)
B. With the (z1, z2) calculated in A., use the following computation graph to calculate the
δL δL δL δL δL δL
loss L, , , , , , .
δc δd δa δb δ z 1 δ z 2
(z1 = 0 z2 = 1) L = - log(1/(1+e)) = log(1+e), a=1, b=e, c=1/(1+e), d=e/(1+e)

δL δL δL −e δL 1 δL −e
=− (1+ e ) =0 = = =
δc δd δa 1+e δb 1+ e δ z1 1+ e
δL e
=
δ z 2 1+e
Question 4. [2+3+0.5+0.5 = 6 marks]
Consider the following training dataset with input X=(x1, x2) and target (desired) output d. A
neuron with two inputs and one output is used for this training dataset. Activation function is
a linear function with zero bias. Sum of square error is used as the loss function.
Input x1 Input x2 Target

Output d
0 0 0.0
Page 3 of 6
1 0 1.0
0 1 1.0
1 1 0.0
A. If back propagation is used, what will be the weights (w1, w2) after convergence?
L = (w1-1)2 + (w2-1)2 + (w1+w2)2 = 2w12 + 2w22 -2w1-2w2+2+2w1w2.
Components of Hessian H: h11=h22=4, h12=h21=2. Thus, H is +ve definite with a
unique local minima. So, backprop will converge to that local minima and optimal
(w1,w2)=(1/3,1/3).
B. What will be the nature of the loss function? What is the value of learning rate which
leads to convergence in least number of iterations? Show all calculation steps.
Loss function is quadratic, not oriented along either w1 or w2. Optimal learning rate
is the 1/(largest eigenvalue of H). The eigenvalues of H are 2, 6. So, optimal learning
rate for fastest convergence is 1/6.
C. To achieve convergence in least number of iterations, will you use batch gradient
descent, stochastic gradient descent or mini batch gradient and why?
Batch GD will converge in least number of iterations since it is convex loss function.
Question 5. [3+3=6 marks]

A. Explain the effect with the reason for following operation on bias and variance of your
model with “Increase”, “Decrease”, or “no change”.
Operation Bias Variance Reason

Increasing the size of the hidden Decreas Increase
layer e
Using dropouts to train deep Increase Decrease
neural network
Getting more training data from No Decrease
same distribution change
Regularizing the weights Increase Decrease

B. Performance of three optimization algorithms, momentum based gradient descent,
RMSProp and ADAM are shown in different colors on the following loss landscape
graph. Which color (green / purple / pink) represents which algorithms and why?
Momentum (Green):
 Momentum overshoots continuously.
 In the given loss landscape, the momentum reached the least cost with the
minimum iteration.
Page 4 of 6
 Momentum usually speeds up the learning with a very minor implementation
change.
RMSProp(Purple):
 In the given loss land scape, smoother trajectory due to normalization with square
of derivative of Loss.
 RMSprop has a faster trajectory in the initial iterations compare to ADAM as it
uses only squares of derivative loss
 In the later iterations, the cost optimization is not much
 RMSprop maintains per-parameter learning rates.
 RMSprop’s adaptive learning rate usually prevents the learning rate decay from
diminishing too slowly or too fast.
Adam (Pink)
 Adam performs a form of learning rate annealing with adaptive step-sizes.
 You can see a smooth but a slow trajectory though out the iterations.
 In the later part of iterations you can see the optimization works much better than
RMSProp
 The hyperparameters of Adam (learning rate, exponential decay rates for the
moment estimates, etc.) are usually set to predefined values, and do not need to be
tuned.
Page 5 of 6
XXXXXXXXXXXX
Page 6 of 6

Solution Dseclzg524!01!102020 Ec2r

Uploaded by

Copyright:

Available Formats

Solution Dseclzg524!01!102020 Ec2r

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solution Dseclzg524!01!102020 Ec2r

Uploaded by

Copyright:

Available Formats

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

Course No. : DSECLZG524

Question 1. [1 + 1*2 + 2 + 1 =6 marks]

A. What is the input shape A in Layer 1? (32323,) or (3072,)

Question 2. [3+3+1=7 Marks]

Bias at each node is indicated inside the node.

A. What will be the output y, if (x1,x2)=(2,3), (x1,x2)=(6,3), and (x1,x2)=(4,3)

Question 3. [1+1+0.75*6=5 Marks]

(z1 = 0 z2 = 1) L = - log(1/(1+e)) = log(1+e), a=1, b=e, c=1/(1+e), d=e/(1+e)

Question 4. [2+3+0.5+0.5 = 6 marks]

Input x1 Input x2 Target

Question 5. [3+3=6 marks]

Operation Bias Variance Reason

You might also like