Module 2
Module 2
Deep Learning
Introduction to deep learning, Deep feed forward network, Training deep models, Optimization
techniques - Gradient Descent (GD), GD with momentum, Nesterov accelerated GD, Stochastic
GD, AdaGrad, RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early
stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout, Parameter initialization.
2
Steps for Training Deep Models
1. Data Collection
Gather a large, diverse dataset suitable for the task (images, text, etc.).
2. Data Preprocessing
Clean the data (remove noise, handle missing values).
Normalize or scale the data to ensure it is in a consistent range.
3. Model Design
Choose an appropriate architecture (e.g., CNN for images, RNN for sequences).
Decide the number of layers, neurons, and activation functions.
4. Split Data
Divide the data into training, validation, and test sets.
5. Initialization
Initialize the model weights properly to avoid vanishing or exploding gradients.
6. Forward Pass
Input data through the network, layer by layer, to calculate the output.
3
7. Loss Calculation
Compute the loss (difference between predicted output and actual output) using a loss
function.
9. Model Evaluation
Evaluate model performance on the validation set after each epoch to tune
hyperparameters and avoid overfitting.
4
Optimization Techniques in Deep
Learning
5
• Optimization techniques in deep learning refer to mathematical and algorithmic
methods used to minimize (or maximize) an objective function, typically the loss
function.
• The goal of optimization is to adjust the model's parameters (weights and biases)
to improve its performance on a given task.
• Gradient-Based Optimization techniques rely on computing gradients of the loss
function to update the model parameters.
• Gradient Descent (GD)
• Batch Gradient Descent (BGD)
• Mini-Batch Gradient Descent
• Stochastic Gradient Descent (SGD)
• Variants of SGD (Momentum-Based Optimizers)
• Momentum
• Nesterov Accelerated Gradient (NAG)
• Adagrad
• RMSprop
• Adam (Adaptive Moment Estimation)
6
Gradient Descent
• Graphically, this means finding the lowest point on the function curve.
7
• The gradient (slope) is positive on the right side of a minimum and negative on
the left side.
• The slope is close to zero at the minimum, indicating a critical point.
• The Gradient Descent (GD) algorithm starts from a random point and moves
downhill to reach the minimum.
• The gradient represents the direction and rate of steepest ascent; we move in
the opposite direction to minimize the function.
8
• The learning rate controls the step size—if too high, the algorithm may
overshoot; if too low, convergence is slow.
9
1. Batch Gradient Descent (BGD)
• Uses the entire dataset to compute the gradient of the loss function.
• Updates model parameters only after evaluating all training samples.
Advantages:
• Produces a stable and smooth convergence.
• Moves steadily toward the optimal solution.
Disadvantages:
• Computationally expensive for large datasets.
• Requires significant memory and processing power.
• Slower updates since it waits for the entire dataset before making a move.
10
2. Mini-Batch Gradient Descent
11
3. Gradient Descent with Momentum
Gradient Descent with momentum is an optimization technique that helps
accelerate gradient descent by smoothing out updates and reducing oscillations.
Why Momentum?
• Standard Gradient Descent (GD) can be slow, especially when gradients oscillate
in different directions.
• Momentum helps GD move faster in the right direction by accumulating past
gradients and using them to update weights.
• This is especially useful in valleys or ravines where standard GD might zig-zag
slowly.
12
13
How Momentum Helps?
14
4. Stochastic Gradient Descent (SGD)
15
16
17
5. Stochastic Gradient Descent (SGD) with
Momentum
• Momentum accelerates SGD by incorporating a fraction of the previous
update into the current update.
• Smooths out updates and prevents oscillations, especially in ravines (steep in
some directions, flat in others).
• Remembers past update direction, reducing reliance on only the current
gradient.
• Prevents zig-zagging, leading to faster convergence, especially in high-
dimensional spaces.
18
19
• Each contour line represents a region of equal loss (or cost)—the closer you get to
the center (red dot), the lower the loss.
• The red dot represents the global minimum, where the optimization should ideally
converge.
20
6. Nesterov Accelerated Gradient (NAG)
• Standard momentum-based gradient descent updates parameters using a velocity
term to smooth out updates.
• NAG improves upon this by calculating the gradient at the "look-ahead" position
instead of the current position, reducing oscillations and overshooting issues.
21
22
23
7. Adaptive Gradient (AdaGrad)
• Modifies SGD by adapting the learning rate per parameter based on how
frequently it has been updated.
• Infrequently updated parameters get larger updates, frequently updated ones
get smaller updates.
• Works well for sparse data where some parameters rarely get gradients.
24
25
8. RMSProp (Root Mean Square Propagation)
• Improves AdaGrad by introducing an exponentially decaying moving average of
past squared gradients, preventing the learning rate from vanishing too quickly.
• This ensures that learning continues even after many updates.
26
27
9. Adam (Adaptive Moment Estimation)
• Combines momentum (for stable updates) and RMSProp (for adaptive learning
rates).
• Uses two moving averages:
• First moment estimate : captures the mean of past gradients (like
momentum).
• Second moment estimate : captures the variance of past gradients (like
RMSProp).
• Bias correction ensures unbiased estimates early in training.
28
29
30
Regularization techniques
Regularization techniques are used to prevent overfitting by adding penalties
to the model's complexity, making it generalize better to new, unseen data.
Key Points:
• Prevents Overfitting: Reduces the model's tendency to memorize noise in
the training data.
• Controls Complexity: Limits the model's complexity by penalizing large or
unnecessary parameters.
• Improves Generalization: Helps the model perform well on unseen data.
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Previous Year Questions
49
1. Elucidate the features of deep feed forward networks. (Slide 2)
2. Explain Nesterov Accelerated Gradient Descent with equations. (Slide 21-23)
3. Compare batch gradient descent and stochastic gradient descent. Describe the
advantages and limitations of each.
50
4. Compare RMSProp with AdaGrad. (Slide 24-27)
5. Why is proper parameter initialization crucial for training deep networks?
51
10. A supervised learning problem is given to model a deep feed forward neural
network. Suggest solutions for a small sized dataset for training.
11. Explain how L2 regularization improves the performance of deep feed forward
neural networks.
52
12. Differentiate stochastic gradient descent with and without momentum. Give
equations for weight updation in SGD with and without momentum.
13. State how to apply early stopping in the context of learning using Gradient
Descent.
53
14. Why is it necessary to use a validation set (instead of simply using the test set)
when using early stopping?
15. Describe the effect in bias and variance when a neural network is modified with
more number of hidden units followed with dropout regularization.
54
16. Describe the advantage of using Adam optimizer instead of basic gradient descent
55
56
Course Level Assessment Questions
57
1. Derive a mathematical expression to show L2 regularization as weight decay. Explain
how L2 regularization improves the performance of deep feed forward neural
networks.
58
2. In stochastic gradient descent, each pass over the dataset requires the same
number of arithmetic operations, whether we use minibatches of size 1 or size 1000.
Why can it nevertheless be more computationally efficient to use minibatches of size
1000?
59
3. State how to apply early stopping in the context of learning using Gradient
Descent. Why is it necessary to use a validation set (instead of simply using the test
set) when using early stopping?
60
4. Suppose that a model does well on the training set, but only achieves an accuracy
of 85% on the validation set. You conclude that the model is overfitting, and plan to
use L1 or L2 regularization to fix the issue. However, you learn that some of the
examples in the data may be incorrectly labeled. Which form of regularisation would
you prefer to use and why?
61
5. Describe one advantage of using Adam optimizer instead of basic gradient descent.
62
Model Questions
63
1. Derive weight updating rule in gradient descent when the error function is
a) mean squared error b) cross entropy.
2. Discuss methods to prevent overfitting in neural networks.
64
3. Differentiate gradient descent with and without momentum. Give equations for
weight updation in GD with and without momentum. Illustrate plateaus, saddle
points and slowly varying gradient.
65
4. Suppose a supervised learning problem is given to model a deep feed forward
neural network. Suggest solutions for the following a) small sized dataset for training
b) dataset with unlabeled data c) large data set but data from different distribution.
66
5. Describe the effect in bias and variance when a neural network is modified with
more number of hidden units followed with dropout regularization
67