Lesson 5 Deep Neural Net Optimization Tuning Interpretability
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
TensorFlow
Deep Neural Net Optimization, Tuning,
and Interpretability
Learning Objectives
Warehouse Placement
Maximum or
Optimum Warehouse Minimize
Yoptimized Shipment
Location
Y Bridge Construction
To Carry
Design Maximum
X X optimized Load
Optimization Algorithm
Optimization Algorithms
Algorithms which are used to solve optimization problems are called optimization algorithm. In deep
learning, optimization algorithms are used to optimize cost function J.
Optimization Algorithms
The value of cost function J is the mean of the loss L between the predicted value y’ and actual value y.
The value y’ is obtained during the forward propagation step and makes use of the weights W and
biases b of the network.
With the help of optimization algorithms, we minimize the value of Cost Function J by updating the
values of the trainable parameters W and b
Types of Optimizers
GD
SGD
GD is used to minimize the cost function J and obtain the optimal weight W and bias b.
The change in values is determined by learning rate and derivatives of J with respect to W and b. This
process is repeated until J has been minimized.
Gradient Descent
If the slope, the partial derivative with respect to W is negative at a point, then the W increases to achieve
the global minimum.
Negative
Slope
Cost
Global
Minimum
Weight
Gradient Descent
If the slope, the partial derivative with respect to W is positive at a point, then the W decreases to achieve
the global minimum.
Positive Slope
Cost
Global
Minimum
Weight
Stochastic Gradient Descent (SGD)
The mathematical formulation for the weight evaluation for SGD is same as GD, but the data points are
shuffled before using them for optimization.
Random data points go into the optimizer and result into random weights, that is the resulted weights are
noisy.
Stochastic Gradient Descent-Mini Batch (SGD-Mini Batch)
Is combination of vanilla GD and SGD which distributes the whole training data in small mini-batches
Divides the training data into small batches, so that the network can easily be trained on the data
The mathematical formulation is same as vanilla GD, but the training occurs batch wise
For example, training set has 400 training examples which are divided into 10 batches with each batch
containing 40 training examples. Thus, the weight evaluation equation will be iterated over 10 times
(number of batches).
GD vs. SGD-Mini Batch
GD is computationally expansive, but it converges into global minimum smoothly. On the other hand, in
SGD, there is more noisy weight created which takes more time to reach the global minimum.
Noisy
Weights
Cost
Weight
Weight
SGD with Momentum
SGD with momentum or just momentum is an advanced optimization algorithm that uses moving
average to update the trainable parameters.
SGD with momentum is a very suitable method to overcome the noisy weights of SGD.
SGD with Momentum
•
SGD with Momentum
Momentum
SGD with and without Momentum
SGD with momentum clearly shows that momentum makes the steps smooth and less noisy.
In NAG, interim parameters are observed if the velocity update leads to bad loss.
In NAG, an interim velocity weight is calculated which is further used to calculate the weight with the help
of a velocity factor.
The difference between momentum method and NAG is in the gradient computation phase.
NAG vs. SGD with Momentum
Both methods give distinct output when the learning rate η is reasonably large. In such a case, NAG allows
larger decay rate α than SGD with momentum method while preventing oscillations. The theorem also
shows that both SGD with momentum and NAG become equal when η is small.
Learning Rate of GD, SGD, and SGD-Mini Batch
Constant
Learning Rate
Adaptive Gradient (AdaGrad)
Initial
Learning Rate
AdaGrad
Small
Nonzero
Value
Root Mean Square Propagation (RMSprop)
•
Adam
To introduce
momentum
Batch Normalization
Data Preprocessing
Normalization Standardization
1 1000
0 1
Why Data Preprocessing?
Data points can either be high or low. This leads to cascading of the network. Therefore, data
preprocessing is needed.
1 1000
Why Data Preprocessing?
When there are multiple features each with different range of data points, the non-processed data creates
instability. It further cascades through the neural network layers. Scaling the different ranges to a standard range
leads to stability and brings in better results.
Weights of the neural network get updated during the training period, in each epoch. Suppose, weight
assigned to a neuron suddenly become large, it cascades through all the layers which further causes
instability.
Normalization of data before feeding into network is not enough, the outputs from the neurons should
also be normalized. This is where batch normalization comes into the picture.
Batch Normalization
Batch
Batch Normalization
Normalization
Batch Normalization
Batch Normalization
⮚ m, s, g, b are all trainable values i.e. the mean, standard deviation along with arbitrary values are
optimized in the training process.
⮚ Large weights are no more a concern, as normalization is applied for every layer’s output per batch, that
is why it is called batch normalization.
Batch
Batch Normalization
Normalization
Implementing Batch Normalization Using Keras
Problem Statement : Batch normalization allows each layer of a network to learn by itself a
little bit more independently of other layers. You are supposed to increase the model
performance by implementing batch normalization optimization technique.
Objective: Build a MLP model to demonstrate the effect of Batch Normalization using Keras.
Access: Click the Practice Labs tab on the left panel. Now, click on the START LAB button and
wait while the lab prepares itself. Then, click on the LAUNCH LAB button. A full-fledged Jupyter
lab opens, which you can use for your hands-on practice and projects.
Vanishing Gradient
What Is Gradient?
Gradient
When gradient becomes very small, subtracting it from the weight doesn't change the previous weight.
Therefore, model stops learning. This problem of neural network is called vanishing gradient.
Change in x
x
Why Does Vanishing Gradient Occur?
Activation function like Sigmoid or Tanh crushes its output into a very small numerical range.
For example, Sigmoid maps the output into 0 to 1 range. As a result, there are large regions of the
input space, even a large change in input will only produce a small change in the output.
Therefore, gradient becomes small or vanishing gradient occurs.
How to Prevent Vanishing Gradient?
Vanishing gradient problem can be avoided by using activation function which doesn’t have the property of
crushing input into very small number range. A popular choice is rectified linear which maps x to max (0, x).
Switching from CPUs to GPUs with faster compilation time has made standard back propagation method
feasible, where the cost of the model is very less.
Use of residual network can avoid problem of vanishing gradient by grouping many short neural networks
together.
Exploding Gradient
What Is Exploding Gradient?
Exploding gradients are a problem, where large error gradients accumulate and result in very
large updates to neural network model weights during training.
How to Fix Exploding Gradients?
Vanishing gradient problem can be fixed by redesigning neural network to have fewer layers and mini
batch sizes.
Gradient clipping is a method by which gradient size can be limited. It is an effective way to fix exploding
gradient.
Hyperparameter Tuning
What Is Parameter?
Parameters are found while training the model. For example, in K-mean clustering, the number of
centroids is a model parameter.
What Is Hyperparameter?
Hyperparameters are found before the training. A classic example of hyperparameter is the value
of K in K-mean clustering which is decided before creating the model.
Hyperparameters of Deep Learning Model
Learning Rate The most important hyperparameter that helps the model to get an optimized result
Number of
A classic hyperparameter that specifies the representational capacity of a model
Hidden Units
Convolutional It influences the capacity of a model by influencing the number of model parameters
Kernel Width in a convolutional neural network
It affects the training process, training speed, and number of iterations in a deep
Mini-Batch Size
learning model
Manual
Automatic
Selection of Hyperparameters
Manual
Technically, both of the selection approaches are viable. The real-world hyperparameter optimization is an
intersection of the two.
Manual Tuning
The change made manually in hyperparameter at a time after each evaluation of neural network is
called manual tuning.
Manual Tuning Approach
From the following example, it can be concluded that manual tuning is not an efficient way as even five times
increase in the neuron resulted only 4% increase in accuracy.
Automatic selection approach is preferred over the manual approach as the latter is a very rigorous method.
Automatic approach is the process of tuning the hyperparameters with the help of algorithms.
They are as follows:
Manual
Grid Search
Random Search
Gradient-Based Tuning
Evolutionary Optimization
Bayesian Optimization
Bayesian Optimization
C++ Frontend
Grid Search
Iterating over given hyperparameters using cross validation is called Grid Search.
Grid Search
Unimportant
Parameter
Important
Parameter
Grid Search
Grid Search
Random Search
Gradient-Based Tuning
Evolutionary Optimization
C++ Frontend
Bayesian Optimization
1 2 3 4
Grid Search
5 6 7 8
Hyperparameters
Grid Search
Eight different hyperparameters are given, grid search takes different hyperparameter values.
1 2 3 4
Grid Search
5 6 7 8
Hyperparameters
Grid Search
Four models are created with the available hyperparameters.
Model 1
1 2 3 4 Model 2
Grid Search
Model 3
5 6 7 8 Model 4
Hyperparameters
Grid Search
The model with the lowest error will be selected as the most efficient model and the hyperparameters
used in the model are finalized.
Model 1
1 2 3 4 Model 2
Grid Search
Model 3
5 6 7 8 Model 4
Hyperparameters
Grid Search
To select the hyperparameters, the given data is divided into three different parts.
Random search is an optimization method used on functions that are not differentiable or continuous.
Random Search
Unimportant
Parameter
Important
Parameter
Random Search
Saves time
Gradient-based tuning is used for algorithms, where it is possible to compute the hyperparameter
with respect to the gradient and optimization of the hyperparameter is done by the gradient
descent.
Evolutionary Optimization
Finds optimal hyperparameters from the result of previously built models with different
hyperparameter configuration through the Gaussian process
Has the inherent property to study the trend in given data set which is not possible for a
human
Interpretability
What Is Interpretability?
Interpretability is the degree of human’s ability to predict the model’s result consistently.
Importance of Interpretability
Reliability To ensure that small changes in the data do not lead to big changes in the prediction
Trust To make it easily trustable for humans as it explains its decisions unlike the machine
When Is Interpretability Not Needed?
Model-specific or Model-agnostic
Model-Specific or Model-Agnostic
C++ Frontend
Intrinsic or Post Hoc
Algorithm Transparency
C++ Frontend
Local Interpretability for a Group of Predictions
instances
Evaluation of Interpretability
Doshi-Velez and Kim (2017) propose three main levels for the evaluation of interpretability:
C++ Frontend
Application-Level Evaluation (Real Task)
Model-Specific or Model-Agnostic
Properties of Explanation Methods
Probability Describes the explanation method suitable for the range of models
Algorithmic
To check that all the relationships picked up are causal
Complexity
Properties of Individual Explanations
Consistency Helps to differentiate among models trained on same data set with same procedure
a. Gradient Descent
a. Gradient Descent
Stochastic gradient descent with momentum optimization algorithm uses moving average.
Knowledge
Check Which of the following optimization functions update learning rate along with the
weights?
2
a. Gradient Descent
d. AdaGrad
Knowledge
Check Which of the following optimization functions update learning rate along with the
weights?
2
a. Gradient Descent
d. AdaGrad
a. Adam
b. AdaGrad
c. AdaDelta
d. RMSProp
Knowledge
Check
Which of the following are the most widespread optimizers in deep learning?
3
a. Adam
b. AdaGrad
c. AdaDelta
d. RMSProp
a. Gradient Exploding
b. Gradient Cliffing
c. Gradient Converging
a. Gradient Exploding
b. Gradient Cliffing
c. Gradient Converging
Gradient exploding is the sudden change in weight of the neural network when large error gradients get accumulated.
Hyperparameter Tuning with MNIST Data Set
Access: Click the Practice Labs tab on the left panel. Now, click on the
START LAB button and wait while the lab prepares itself. Then, click on the
LAUNCH LAB button. A full-fledged Jupyter lab opens, which you can use
for your hands-on practice and projects.
Thank You