Solution Dseclzg524!01!102020 Ec2r
Solution Dseclzg524!01!102020 Ec2r
Solution Dseclzg524!01!102020 Ec2r
SOLUTION
Consider the following DNN for image classification for a dataset that consists of RGB
images of size 32x32.
model = models.Sequential()
# Layer 1
model.add(layers.Dense(50, activation='relu',input_shape=**A**))
# Layer 2
model.add(layers.Dense(40, activation='relu'))
# Layer 3
model.add(layers.Dense(30, activation='relu'))
# Layer 4
model.add(layers.Dense(**B**, activation=**C**))
model.compile(optimizer = 'sgd', loss = **D**, metrics=['accuracy
'])
Page 1 of 6
D. What is the difference between kernel regularizers and activity regularizers?
kernel_regularizer: Regularizer to apply a penalty on the layer's kernel.
activity_regularizer: Regularizer to apply a penalty on the layer's output.
Page 2 of 6
A. Calculate the actual output (y1, y2) for the current iteration with input (x1, x2) = (1, 1).
y1= 1 / (1+e) y2= e / (1+e)
B. With the (z1, z2) calculated in A., use the following computation graph to calculate the
δL δL δL δL δL δL
loss L, , , , , , .
δc δd δa δb δ z 1 δ z 2
Consider the following training dataset with input X=(x1, x2) and target (desired) output d. A
neuron with two inputs and one output is used for this training dataset. Activation function is
a linear function with zero bias. Sum of square error is used as the loss function.
Page 3 of 6
1 0 1.0
0 1 1.0
1 1 0.0
A. If back propagation is used, what will be the weights (w1, w2) after convergence?
L = (w1-1)2 + (w2-1)2 + (w1+w2)2 = 2w12 + 2w22 -2w1-2w2+2+2w1w2.
Components of Hessian H: h11=h22=4, h12=h21=2. Thus, H is +ve definite with a
unique local minima. So, backprop will converge to that local minima and optimal
(w1,w2)=(1/3,1/3).
B. What will be the nature of the loss function? What is the value of learning rate which
leads to convergence in least number of iterations? Show all calculation steps.
Loss function is quadratic, not oriented along either w1 or w2. Optimal learning rate
is the 1/(largest eigenvalue of H). The eigenvalues of H are 2, 6. So, optimal learning
rate for fastest convergence is 1/6.
C. To achieve convergence in least number of iterations, will you use batch gradient
descent, stochastic gradient descent or mini batch gradient and why?
Batch GD will converge in least number of iterations since it is convex loss function.
B. Performance of three optimization algorithms, momentum based gradient descent,
RMSProp and ADAM are shown in different colors on the following loss landscape
graph. Which color (green / purple / pink) represents which algorithms and why?
Momentum (Green):
Momentum overshoots continuously.
In the given loss landscape, the momentum reached the least cost with the
minimum iteration.
Page 4 of 6
Momentum usually speeds up the learning with a very minor implementation
change.
RMSProp(Purple):
In the given loss land scape, smoother trajectory due to normalization with square
of derivative of Loss.
RMSprop has a faster trajectory in the initial iterations compare to ADAM as it
uses only squares of derivative loss
In the later iterations, the cost optimization is not much
RMSprop maintains per-parameter learning rates.
RMSprop’s adaptive learning rate usually prevents the learning rate decay from
diminishing too slowly or too fast.
Adam (Pink)
Adam performs a form of learning rate annealing with adaptive step-sizes.
You can see a smooth but a slow trajectory though out the iterations.
In the later part of iterations you can see the optimization works much better than
RMSProp
The hyperparameters of Adam (learning rate, exponential decay rates for the
moment estimates, etc.) are usually set to predefined values, and do not need to be
tuned.
Page 5 of 6
XXXXXXXXXXXX
Page 6 of 6