Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module 2

The document provides an overview of deep learning, focusing on deep feedforward networks, training processes, and various optimization techniques such as Gradient Descent and its variants. It discusses regularization methods to prevent overfitting and improve model generalization. Additionally, it includes a series of assessment questions related to the concepts covered in the module.

Uploaded by

akshaylalsp6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 2

The document provides an overview of deep learning, focusing on deep feedforward networks, training processes, and various optimization techniques such as Gradient Descent and its variants. It discusses regularization methods to prevent overfitting and improve model generalization. Additionally, it includes a series of assessment questions related to the concepts covered in the module.

Uploaded by

akshaylalsp6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Module 2

Deep Learning
Introduction to deep learning, Deep feed forward network, Training deep models, Optimization
techniques - Gradient Descent (GD), GD with momentum, Nesterov accelerated GD, Stochastic
GD, AdaGrad, RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early
stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout, Parameter initialization.

Reena Thomas, Asst. Prof., CSE dept., CEMP 1


• Deep Learning is a subset of Machine Learning that uses neural networks with
many layers (hence "deep") to model and understand complex patterns in large
amounts of data. It is inspired by the structure and function of the human brain,
allowing computers to learn from data without explicit programming.

• A Deep Feedforward Network is a type of neural network where information


moves in one direction—from input to output—through multiple layers of
neurons (multi-layer neural network) , without any loops or cycles.

Features of Deep Feedforward Network


• Feedforward: Data moves forward through the network, layer by layer, without
feedback connections.
• Deep: The network has multiple hidden layers between input and output layers.
• Fully connected: Each neuron in a layer is connected to every neuron in the
next layer.

2
Steps for Training Deep Models
1. Data Collection
Gather a large, diverse dataset suitable for the task (images, text, etc.).

2. Data Preprocessing
Clean the data (remove noise, handle missing values).
Normalize or scale the data to ensure it is in a consistent range.

3. Model Design
Choose an appropriate architecture (e.g., CNN for images, RNN for sequences).
Decide the number of layers, neurons, and activation functions.

4. Split Data
Divide the data into training, validation, and test sets.

5. Initialization
Initialize the model weights properly to avoid vanishing or exploding gradients.

6. Forward Pass
Input data through the network, layer by layer, to calculate the output.
3
7. Loss Calculation
Compute the loss (difference between predicted output and actual output) using a loss
function.

8. Backward Pass (Backpropagation)


Compute gradients of the loss with respect to model parameters using the chain rule.
Update weights to minimize the loss using gradient descent or an optimizer like Adam.

9. Model Evaluation
Evaluate model performance on the validation set after each epoch to tune
hyperparameters and avoid overfitting.

10. Iterate and Tune


Train for multiple epochs (iterations).
Adjust hyperparameters like learning rate, batch size, and regularization based on
performance.

11. Final Evaluation


Once training is complete, evaluate the model's performance on the test set to check
its generalization ability.

4
Optimization Techniques in Deep
Learning

5
• Optimization techniques in deep learning refer to mathematical and algorithmic
methods used to minimize (or maximize) an objective function, typically the loss
function.
• The goal of optimization is to adjust the model's parameters (weights and biases)
to improve its performance on a given task.
• Gradient-Based Optimization techniques rely on computing gradients of the loss
function to update the model parameters.
• Gradient Descent (GD)
• Batch Gradient Descent (BGD)
• Mini-Batch Gradient Descent
• Stochastic Gradient Descent (SGD)
• Variants of SGD (Momentum-Based Optimizers)
• Momentum
• Nesterov Accelerated Gradient (NAG)
• Adagrad
• RMSprop
• Adam (Adaptive Moment Estimation)
6
Gradient Descent

• Gradient Descent is an optimization technique used to find a local minimum of


a differentiable function.
• The weights (or parameters) are initialized using specific strategies and updated
iteratively using an update equation.

• This process continues until the optimization function converges to a minimum


(not necessarily zero).

• Graphically, this means finding the lowest point on the function curve.

7
• The gradient (slope) is positive on the right side of a minimum and negative on
the left side.
• The slope is close to zero at the minimum, indicating a critical point.
• The Gradient Descent (GD) algorithm starts from a random point and moves
downhill to reach the minimum.
• The gradient represents the direction and rate of steepest ascent; we move in
the opposite direction to minimize the function.

8
• The learning rate controls the step size—if too high, the algorithm may
overshoot; if too low, convergence is slow.

• GD works well for convex functions.


• But may get stuck in local minima or saddle points in non-convex functions.

9
1. Batch Gradient Descent (BGD)

• Uses the entire dataset to compute the gradient of the loss function.
• Updates model parameters only after evaluating all training samples.
Advantages:
• Produces a stable and smooth convergence.
• Moves steadily toward the optimal solution.
Disadvantages:
• Computationally expensive for large datasets.
• Requires significant memory and processing power.
• Slower updates since it waits for the entire dataset before making a move.

10
2. Mini-Batch Gradient Descent

• Splits the dataset into smaller batches.


• Computes the gradient and updates weights using each mini-batch.
Advantages:
• Faster than batch gradient descent.
• More stable than stochastic gradient descent.
• Vectorized operations let GPUs (Graphics Processing Unit)/ TPUs (tensor
Processing Unit) process many data points at once, making tasks faster and
more efficient.
Disadvantages:
• Still requires tuning the batch size for efficiency.
• Some noise in updates, but less than SGD.

11
3. Gradient Descent with Momentum
Gradient Descent with momentum is an optimization technique that helps
accelerate gradient descent by smoothing out updates and reducing oscillations.
Why Momentum?
• Standard Gradient Descent (GD) can be slow, especially when gradients oscillate
in different directions.
• Momentum helps GD move faster in the right direction by accumulating past
gradients and using them to update weights.
• This is especially useful in valleys or ravines where standard GD might zig-zag
slowly.

12
13
How Momentum Helps?

• Reduces oscillations → Moves smoothly instead of bouncing back and forth.


• Speeds up convergence → Faster movement in the right direction.
• Escapes local minima → Helps overcome small bumps in the loss landscape.

14
4. Stochastic Gradient Descent (SGD)

• It is an optimization algorithm used to minimize the loss function in machine learning


models.
• It is commonly used for training neural networks and other machine learning models.
• Instead of computing gradients using the entire dataset (as in Batch Gradient Descent),
SGD updates parameters using only one data point (or a small batch) at a time.
• The model’s parameters (weights) are updated in the direction that reduces the loss.
• This approach makes SGD faster and more memory-efficient, especially for large
datasets.
• However, it introduces randomness (stochasticity) in updates, which can help escape
local minima but also causes more noise in convergence.

15
16
17
5. Stochastic Gradient Descent (SGD) with
Momentum
• Momentum accelerates SGD by incorporating a fraction of the previous
update into the current update.
• Smooths out updates and prevents oscillations, especially in ravines (steep in
some directions, flat in others).
• Remembers past update direction, reducing reliance on only the current
gradient.
• Prevents zig-zagging, leading to faster convergence, especially in high-
dimensional spaces.

18
19
• Each contour line represents a region of equal loss (or cost)—the closer you get to
the center (red dot), the lower the loss.
• The red dot represents the global minimum, where the optimization should ideally
converge.

20
6. Nesterov Accelerated Gradient (NAG)
• Standard momentum-based gradient descent updates parameters using a velocity
term to smooth out updates.
• NAG improves upon this by calculating the gradient at the "look-ahead" position
instead of the current position, reducing oscillations and overshooting issues.

21
22
23
7. Adaptive Gradient (AdaGrad)
• Modifies SGD by adapting the learning rate per parameter based on how
frequently it has been updated.
• Infrequently updated parameters get larger updates, frequently updated ones
get smaller updates.
• Works well for sparse data where some parameters rarely get gradients.

24
25
8. RMSProp (Root Mean Square Propagation)
• Improves AdaGrad by introducing an exponentially decaying moving average of
past squared gradients, preventing the learning rate from vanishing too quickly.
• This ensures that learning continues even after many updates.

26
27
9. Adam (Adaptive Moment Estimation)
• Combines momentum (for stable updates) and RMSProp (for adaptive learning
rates).
• Uses two moving averages:
• First moment estimate : captures the mean of past gradients (like
momentum).
• Second moment estimate : captures the variance of past gradients (like
RMSProp).
• Bias correction ensures unbiased estimates early in training.

28
29
30
Regularization techniques
Regularization techniques are used to prevent overfitting by adding penalties
to the model's complexity, making it generalize better to new, unseen data.

Key Points:
• Prevents Overfitting: Reduces the model's tendency to memorize noise in
the training data.
• Controls Complexity: Limits the model's complexity by penalizing large or
unnecessary parameters.
• Improves Generalization: Helps the model perform well on unseen data.

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Previous Year Questions

49
1. Elucidate the features of deep feed forward networks. (Slide 2)
2. Explain Nesterov Accelerated Gradient Descent with equations. (Slide 21-23)
3. Compare batch gradient descent and stochastic gradient descent. Describe the
advantages and limitations of each.

50
4. Compare RMSProp with AdaGrad. (Slide 24-27)
5. Why is proper parameter initialization crucial for training deep networks?

6. Explain different parameter initialization techniques. (Slide 41,42)


7. Compare L1 and L2 regularization. . (Slide 33-36)
8. Explain different gradient descent optimization strategies used in deep learning.
(Slide 7-30)
9. Explain the following.
i)Early stopping (37,38) ii) Drop out (43-44)
iii) Injecting noise at input (47) iv)Parameter sharing and tying (48)

51
10. A supervised learning problem is given to model a deep feed forward neural
network. Suggest solutions for a small sized dataset for training.

11. Explain how L2 regularization improves the performance of deep feed forward
neural networks.

52
12. Differentiate stochastic gradient descent with and without momentum. Give
equations for weight updation in SGD with and without momentum.

13. State how to apply early stopping in the context of learning using Gradient
Descent.

53
14. Why is it necessary to use a validation set (instead of simply using the test set)
when using early stopping?

15. Describe the effect in bias and variance when a neural network is modified with
more number of hidden units followed with dropout regularization.

54
16. Describe the advantage of using Adam optimizer instead of basic gradient descent

55
56
Course Level Assessment Questions

57
1. Derive a mathematical expression to show L2 regularization as weight decay. Explain
how L2 regularization improves the performance of deep feed forward neural
networks.

58
2. In stochastic gradient descent, each pass over the dataset requires the same
number of arithmetic operations, whether we use minibatches of size 1 or size 1000.
Why can it nevertheless be more computationally efficient to use minibatches of size
1000?

59
3. State how to apply early stopping in the context of learning using Gradient
Descent. Why is it necessary to use a validation set (instead of simply using the test
set) when using early stopping?

60
4. Suppose that a model does well on the training set, but only achieves an accuracy
of 85% on the validation set. You conclude that the model is overfitting, and plan to
use L1 or L2 regularization to fix the issue. However, you learn that some of the
examples in the data may be incorrectly labeled. Which form of regularisation would
you prefer to use and why?

61
5. Describe one advantage of using Adam optimizer instead of basic gradient descent.

62
Model Questions

63
1. Derive weight updating rule in gradient descent when the error function is
a) mean squared error b) cross entropy.
2. Discuss methods to prevent overfitting in neural networks.

64
3. Differentiate gradient descent with and without momentum. Give equations for
weight updation in GD with and without momentum. Illustrate plateaus, saddle
points and slowly varying gradient.

65
4. Suppose a supervised learning problem is given to model a deep feed forward
neural network. Suggest solutions for the following a) small sized dataset for training
b) dataset with unlabeled data c) large data set but data from different distribution.

66
5. Describe the effect in bias and variance when a neural network is modified with
more number of hidden units followed with dropout regularization

67

You might also like