ADL Unit-3

ADL: Unit-3
IMPROVED OPTIMIZATION
Newer Optimization Algorithms For Training Neural Network

Optimizers are algorithms or methods used to change the attributes of
your neural network such as weights and learning rate in order to reduce
the losses.
1.Adagrad:
Adagrad, short for Adaptive Gradient Algorithm, is an optimization

algorithm used for training neural networks. Here's a deeper dive into
how it works:
Core Idea: Unlike standard optimizers with a constant learning rate for
all parameters, Adagrad adapts the learning rate individually for each
parameter based on its historical gradients.
How it Works:
1. Accumulate Squared Gradients: Adagrad keeps track of the

sum of squared gradients for each parameter over time.
2. Adaptive Learning Rate: This accumulated sum is used to
compute an adaptive learning rate for each parameter during each
update. Higher accumulated squared gradients (indicating
frequent updates) result in lower learning rates, preventing those
parameters from overshooting the optimal value.
A derivative of loss function for given parameters at a given time t.

Update parameters for given input i and at time/iteration t
η is a learning rate which is modified for given parameter θ(i) at a given time based
on previous gradients calculated for given parameter θ(i).
We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ϵ
is a smoothing term that avoids division by zero (usually on the order of 1e−8).
Interestingly, without the square root operation, the algorithm performs much
worse.
It makes big updates for less frequent parameters and a small step for frequent
parameters.
Advantages:
1. Learning rate changes for each training parameter.

2. Don’t need to manually tune the learning rate.
3. Able to train on sparse data.
Disadvantages:
1. Computationally expensive as a need to calculate the second

order derivative.
2. The learning rate is always decreasing results in slow training.
Overall:
Adagrad is a powerful optimizer for problems with sparse data, but be

aware of the potential for diminishing learning rates in later training
stages. Consider AdaDelta or RMSProp if this becomes a concern. It's
always a good practice to experiment with different optimizers to find
the best fit for your specific neural network application.
2.adadelta
AdaDelta, short for Adaptive Delta, is an optimization algorithm

designed to address a specific limitation of its predecessor, Adagrad, in
training neural networks.
The Problem with Adagrad:
Adagrad adapts the learning rate for each parameter based on the history
of its squared gradients. This is great for sparse data, but it can lead to a
problem called diminishing learning rates. As training progresses,
the sum of squared gradients keeps growing, causing the learning rates
to constantly decrease. Eventually, the learning rates become too small,
hindering further improvement and stalling the training process.
In this exponentially moving average is used rather than the sum of all
the gradients.
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
We set γ to a similar value as the momentum term, around 0.9.

Update the parameters
How AdaDelta Improves:
AdaDelta builds on the idea of Adagrad but introduces a key difference in

how it calculates the adaptive learning rate. Here's the approach:
1. Decaying Average: Instead of accumulating all historical

squared gradients, AdaDelta uses a decaying average. This
means it considers past gradients but gives more weight to recent
updates.
2. Fixed Window Size: Unlike Adagrad, AdaDelta uses a fixed
window size for this decaying average. This prevents the sum of
squared gradients from growing indefinitely and ensures the
learning rates don't become excessively small.
Advantages:
1. Now the learning rate does not decay and the training does not
stop.
Disadvantages:
1. Computationally expensive.
3.rmsprop:RMSProp, which stands for Root Mean Square Prop, is an
optimization algorithm used for training neural networks. It addresses a
challenge faced by Stochastic Gradient Descent (SGD) and aims to
accelerate learning.
Addressing SGD's Limitation:
SGD can struggle with gradients that keep oscillating or decreasing for
certain weights. This can hinder learning progress in those directions.
RMSProp tackles this issue by adapting the learning rate for each weight
based on its recent squared gradients.
Core Idea of RMSProp:
1. Track Squared Gradients: RMSProp maintains an

exponentially decaying average of squared gradients for each
weight. This average reflects the recent history of how much the
parameter has been updated.
2. Adaptive Learning Rate: During each update, the learning rate
for a weight is calculated based on its current gradient and the
exponentially decaying average of squared gradients.
RMSprop in fact is identical to the first update vector of Adadelta that we derived
above:
RMSprop as well divides the learning rate by an exponentially decaying average

of squared gradients. Hinton suggests γ be set to 0.9, while a good default value
for the learning rate η is 0.001.
RMSprop and Adadelta have both been developed independently around the
same time stemming from the need to resolve Adagrad's radically diminishing
learning rates
Advantages:
1.faster convergence
2.handles noisy gradients well, and
3. requires fewer hyperparameters to tune
Disadvantages
1.can be sensitive to learning rate
2.lacks momentum compared to some optimizers.
Overall, RMSProp is a powerful optimizer that can improve the efficiency and
stability of training compared to SGD. It's a good choice for various neural
network architectures, particularly when dealing with oscillating or noisy
gradients.
RMSProp is often seen as a simpler and more efficient alternative to Adam,

another popular optimizer.
4.adam:
Adam (Adaptive Moment Estimation) is a popular optimization algorithm for training neural
networks, combining the strengths of RMSProp and momentum to deliver efficient and
stable learning.
Core Idea:
● Combines Strengths: Adam incorporates the benefits of both RMSProp (adaptive

learning rates) and momentum (exploiting historical gradients) to address limitations
of SGD.
● Adaptive Learning Rates: Similar to RMSProp, Adam maintains an exponentially

decaying average of squared gradients for each parameter. This helps adjust
learning rates based on individual parameter behavior.
● Momentum: Adam also keeps an exponentially decaying average of past gradients,

mimicking momentum and allowing for smoother convergence, especially in
situations with noisy or oscillating gradients.
Adam computes adaptive learning rates for each parameter. In addition to storing
an exponentially decaying average of past squared gradients vt like Adadelta and
RMSprop, Adam also keeps an exponentially decaying average of past gradients
mt, similar to momentum. Whereas momentum can be seen as a ball running
down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat
minima in the error surface.
Hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these

moving averages. We compute the decaying averages of past and past squared
gradients mt and vt respectively as follows:
mt and vt are estimates of the first moment (the mean) and the second moment
(the uncentered variance) of the gradients respectively, hence the name of the
method.
Advantages:
● Speedy Convergence: Adam combines adaptive learning rates and
momentum, often leading to much faster training compared to vanilla SGD.
● Jack-of-All-Trades: It performs well across various neural network tasks,
making it a versatile choice for many applications.
● Less Tuning Hassle: Compared to some optimizers, Adam requires fewer
hyperparameters to fine-tune, simplifying the setup process.
Disadvantages:
● Not a Guaranteed Winner: While powerful, Adam isn't a magic bullet.

Experimenting with other optimizers might be necessary for specific
problems.
● Learning Rate Matters: Although it adapts learning rates, choosing a
good initial value can still impact Adam's performance.
● Theoretical Concerns: Some researchers raise concerns about the
theoretical convergence guarantees of Adam compared to other
optimizers.
5.NAG
Nesterov Accelerated Gradient (NAG) is a powerful optimization algorithm used

for training neural networks. It improves upon the standard momentum method by
considering a "lookahead" approach to the parameter update.
Here's a breakdown of NAG:
● Improves Momentum: It builds on the idea of momentum, which helps

navigate through flat regions of the loss function.
● Looks Ahead: Unlike standard momentum, NAG takes a peek into the
future parameter position based on the current momentum and then
calculates the gradient at that point.
● Corrects Course: This "lookahead" allows NAG to make a more informed
update based on the anticipated future location, potentially leading to
faster convergence and reduced oscillations.
—--------------
The idea of the NAG algorithm is very similar to SGD with momentum with a
slight variant. In the case of SGD with a momentum algorithm, the momentum
and gradient are computed on the previous updated weight.
Momentum may be a good method but if the momentum is too high the algorithm
may miss the local minima and may continue to rise up. So, to resolve this issue
the NAG algorithm was developed. It is a look ahead method. We know we’ll be
using γ.V(t−1) for modifying the weights so, θ−γV(t−1) approximately tells us the
future location. Now, we’ll calculate the cost based on this future parameter rather
than the current one.
V(t) = γ.V(t−1) + α. ∂(J(θ − γV(t−1)))/∂θ
and then update the parameters using θ = θ − V(t)
Again, we set the momentum term γγ to a value of around 0.9. While Momentum
first computes the current gradient (small brown vector in Image 4) and then
takes a big jump in the direction of the updated accumulated gradient (big brown
vector), NAG first makes a big jump in the direction of the previously accumulated
gradient (green vector), measures the gradient and then makes a correction (red
vector), which results in the complete NAG update (red vector). This anticipatory
update prevents us from going too fast and results in increased responsiveness,
which has significantly increased the performance of RNNs on a number of tasks.
—----------------
Advantages
1.faster convergence and
2.handles uneven loss landscapes well
Disadvantages
1.may not always be the best choice
2.has some overlap with Adam's functionality.
SECOND ORDER METHODS FOR NEURAL NETWORK

In deep learning, training usually involves optimizing a complex function (the loss function) to
find the best set of parameters for your neural network. Most optimizers used today, like
SGD and its variants, are first-order methods. They rely on the gradients (the first derivative)
of the loss function to update the parameters.
Second-order optimization methods take a different approach. They additionally consider

the curvature (represented by the second derivative) of the loss function to make update
decisions. This can lead to several advantages:
● Faster Convergence: In theory, second-order methods can converge to the optimal
solution much quicker than first-order methods like SGD.
However, there's a catch:
● Computational Cost: Calculating and utilizing the second derivative (Hessian) is

computationally expensive, especially for large deep learning models with millions of
parameters. This makes them less scalable.
● Sensitivity to Noise: Second-order methods can be sensitive to noise in the data or
gradients, potentially leading to poor performance.
Here are some specific types of second-order methods used in deep learning:
● Newton's Method: The most well-known second-order method, offering the fastest
theoretical convergence but also the most computationally expensive due to the
exact Hessian calculation.
● Quasi-Newton Methods: These methods approximate the Hessian using
information from past gradients, making them more scalable than Newton's Method
but potentially less accurate.
● Hessian-Free Methods: These avoid calculating the Hessian entirely, using
gradients and function values for updates. They offer better scalability but might have
slower convergence.
Overall, second-order methods are still an area of active research in deep learning.
While not always practical for large models due to computational limitations, they
offer valuable insights into the optimization landscape and can be useful for smaller
models or specific tasks where faster convergence is a priority.
In detail about methods:

1. Newton’s method:
Newton's method, named after Isaac Newton, is an optimization technique used in various
fields, including finding the roots of equations and training models in machine learning. In
deep learning, it's a second-order optimization method that utilizes the curvature of the loss
function to find the minimum faster, at least in theory.
Here's how it works for finding the root (x-intercept) of a function f(x) which means solving
the equation f(x) = 0:
1. Initial Guess: You start with an initial guess (x₀) of where the root might be. This can
be based on your knowledge of the function or pure intuition.
2. Calculate Slope and Line: Imagine the graph of the function f(x). At your initial
guess (x₀), Newton's method calculates the slope of the tangent line to the function at
that point. This slope is obtained using the first derivative of f(x), evaluated at x₀
(f'(x₀)).
3. Intersection with X-axis: The tangent line is then extended until it intersects the
x-axis. This intersection point (x₁) represents a better estimate of the root compared
to your initial guess.
4. Repeat: You use the newly found x₁ as your next guess and repeat the process
(calculate slope, extend tangent line, find new intersection). The idea is that with
each iteration, you get closer and closer to the actual root.
Advantages of Newton's Method:
● Fast Convergence: In theory, Newton's method can converge to the root very
quickly, especially if your initial guess is close to the actual solution.
Disadvantages of Newton's Method:
● Computational Cost: Calculating the derivative (especially the second derivative

which Newton's method uses) can be expensive, especially for complex functions
used in deep learning.
● Scalability Issues: Due to the high computational cost, Newton's method becomes
impractical for training large deep learning models.
● Sensitivity to Noise: If the function or gradients are noisy, Newton's method can get
stuck in saddle points (not the true minimum) or diverge completely.
In summary:
While Newton's method offers the potential for faster convergence, its computational cost
and sensitivity to noise make it less favorable for training large deep learning models.
However, it remains a valuable concept in understanding optimization techniques and can be
useful for smaller models or specific tasks where faster convergence is crucial.
2.Quasi-Newton methods are a group of optimization algorithms used in deep

learning that address the limitations of Newton's method. While Newton's method
offers the fastest convergence theoretically, its dependence on calculating the
Hessian (matrix of second derivatives) makes it computationally expensive and
impractical for large models.
Quasi-Newton methods take a middle ground:
● Goal: Like Newton's method, they aim to find the minimum of a function (often the
loss function in deep learning).
● Approach: Instead of calculating the exact Hessian at every step, they approximate
it using information from past gradients. This makes them more scalable for larger
models compared to Newton's method.
Here's a breakdown of how Quasi-Newton methods work:
1. Similar to Gradient Descent: They start with an initial guess for the parameters and
iteratively update them in the direction that minimizes the loss function.
2. Leveraging Past Information: Unlike SGD (Stochastic Gradient Descent) which
only uses the current gradient, Quasi-Newton methods incorporate information from
past gradients to build an approximation of the Hessian.
3. Updating the Approximation: As the optimization progresses, the approximation of
the Hessian is refined based on new gradient information.
Popular Quasi-Newton Methods:
● L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno): This is a widely

used variant that stores a limited history of gradients to approximate the Hessian,
making it efficient for large models.
Advantages of Quasi-Newton Methods:
● Faster Convergence: Compared to SGD, they can converge to the minimum point
of the loss function quicker, especially for well-behaved loss functions.
● More Scalable: Their use of approximated Hessians makes them more suitable for
training larger deep learning models compared to Newton's method.
Disadvantages of Quasi-Newton Methods:
● Still Computationally Expensive: While more efficient than Newton's method, they
are still more computationally demanding than SGD.
● Approximation Accuracy: The accuracy of the Hessian approximation can affect
the convergence rate. For complex loss functions, the approximation might not be
very accurate, hindering performance.
Overall, Quasi-Newton methods offer a valuable compromise between the speed of

Newton's method and the computational efficiency of SGD. They are a good choice
for training moderately large deep learning models where faster convergence is
desired but computational resources are a constraint.
3.Hessian-Free Methods:
Hessian-free methods are a category of optimization algorithms used in deep learning

that aim to leverage the benefits of second-order optimization (faster convergence)
while addressing the limitations of directly calculating the Hessian matrix. As you
know, the Hessian, which represents the curvature of the loss function, can be
computationally expensive to compute, especially for large deep learning models with
millions of parameters.
Core Idea:
Hessian-free methods bypass the explicit calculation of the Hessian entirely. Instead,
they utilize information readily available during training, such as gradients and
function values, to update the model's parameters and guide them towards the
minimum of the loss function.
How it Works:
1. Similar Starting Point: These methods often begin with an initial guess for the
model parameters, similar to other optimization algorithms.
2. Leveraging Gradients: They utilize the gradients of the loss function with
respect to the parameters. The gradient indicates the direction of steepest
descent, providing valuable information about how to update the parameters to
minimize the loss.
3. Additional Information: Some Hessian-free methods might also incorporate
additional information beyond gradients, such as curvature information
obtained through techniques like finite differences, to improve the update
direction.
4. Iterative Updates: The parameters are then updated iteratively based on the
calculated direction and a chosen step size. As the optimization progresses,
the updates become more refined, gradually moving towards the minimum.
Popular Hessian-Free Methods:
● Krylov Subspace Methods: These methods utilize a mathematical framework

called Krylov subspaces to efficiently navigate the optimization landscape
using gradient information.
● Truncated Newton Methods: These methods perform limited calculations that
approximate the curvature of the loss function, avoiding the full Hessian
computation.
Advantages of Hessian-Free Methods:
● Scalability: By avoiding the Hessian calculation, they are significantly more

scalable than exact second-order methods like Newton's method, making them
suitable for training larger deep learning models.
● Less Sensitive to Noise: Compared to some second-order methods, they can
be less sensitive to noise in the data or gradients, leading to more robust
optimization.
Disadvantages of Hessian-Free Methods:
● Convergence Rate: While often faster than SGD, their convergence rate might
be slower than exact second-order methods due to the lack of a complete
Hessian picture.
● Hyperparameter Tuning: These methods might require careful tuning of
hyperparameters to achieve optimal performance.
Overall, Hessian-free methods offer a promising approach for training deep learning
models. They combine the efficiency of first-order methods (like SGD) with some of
the convergence benefits of second-order methods, making them a valuable tool for
various deep learning tasks.
SADDLE POINT PROBLEM
https://www.youtube.com/watch?v=ktxztPzQg6o&ab_channel=LearningMonkey
Eqn in video
The saddle point problem is a significant challenge encountered during training

neural networks. It occurs when the optimization algorithm gets stuck at a point in the
loss function's landscape that is neither a minimum nor a maximum. Imagine a saddle
on a horseback riding track - you're at a high point, but it's not the highest point, and
going in any direction might not lead you downhill.
Here's a breakdown of the saddle point problem in neural networks:
● Loss Function Landscape: When training a neural network, we aim to minimize

a loss function that represents the model's performance. This loss function
can be visualized as a 3D landscape with hills, valleys, and flat areas.
● The Culprit: Saddle Points: These saddle points appear as flat regions or
ridges in the loss function landscape. The gradient, which guides the
optimization algorithm's movement, becomes zero at these points. While the
algorithm might feel like it's making progress (since the gradient is zero), it's
not actually getting closer to the minimum (the ideal spot for good
performance).
Consequences of Saddle Points:
● Slows Down Training: If the optimization algorithm gets stuck in a saddle point,
it can significantly slow down the training process, hindering the model's
ability to learn effectively.
● Suboptimal Performance: Even if the algorithm escapes a saddle point, it might
not reach the true minimum point, leading to suboptimal performance for the
neural network.
How to Deal with Saddle Points:
● Momentum: This technique incorporates the direction of previous updates,

allowing the algorithm to have some inertia and potentially overcome shallow
saddle points.
● Adaptive Learning Rates: Optimizers like Adam and RMSProp adjust learning
rates for different parameters, which can help the algorithm navigate uneven
landscapes and avoid getting stuck in saddle points.
● Weight Initialization: Proper initialization of weights can influence the training
trajectory and make the model less susceptible to saddle points.
● Second-Order Optimization Methods: While computationally expensive, these
methods consider the curvature of the loss function and might provide faster
escape from saddle points (though not always practical for large models).
Overall, the saddle point problem is a roadblock in neural network training. By

understanding its nature and employing techniques like momentum, adaptive learning
rates, and careful weight initialization, we can improve the efficiency and
effectiveness of training, leading to better performing neural networks.
Regularization Methods
Regularization is a set of techniques used in machine learning to prevent

overfitting and improve the generalizability of models. Overfitting occurs when
a model learns the training data too well, including the noise and irrelevant
details, which hinders its ability to perform well on unseen data.
Here's how regularization methods work:
1. Penalizing Model Complexity: Regularization techniques introduce a

penalty term to the loss function being optimized during training. This
penalty term discourages the model from becoming too complex or fitting
the training data too closely.
2. Finding the Balance: The goal is to find a balance between fitting the
training data well and keeping the model general enough to perform well
on unseen data. Regularization helps achieve this balance by introducing a
trade-off between reducing the training error and the penalty term.
Methods:
1.Dropout is a powerful regularization technique used in neural networks to

prevent overfitting and improve model generalizability. Here's a breakdown of
dropout and how it works:
The Overfitting Problem:
Overfitting occurs when a neural network model learns the training data too well,
including the noise and irrelevant details. This can lead to poor performance on
unseen data, the ultimate test of a model's effectiveness.
How Dropout Works:
1. Randomly Dropping Neurons: During training, dropout randomly

deactivates (drops out) a certain percentage of neurons in each layer of
the neural network, except for the input layer. This deactivation happens
with a probability (e.g., 20%) at each training step.
2. Forcing Redundancy: By randomly dropping neurons, dropout forces the

network to learn features that are not dependent on any specific neuron.
This encourages redundancy and robustness in the model's internal
representation.
3. New Network Every Iteration: Since different neurons are dropped out at
each training step, the network effectively encounters a new architecture
during each iteration. This helps prevent the model from becoming overly
reliant on specific features or connections.
4. No Need for Specific Neuron Selection: Unlike L1 regularization (which
encourages sparsity by driving some weights to zero), dropout doesn't
explicitly select which neurons are important. The random deactivation
forces the network to learn robust representations across different subsets
of neurons.
Implementation:
Dropout is typically implemented during training only, not during inference (using
the model for prediction). Libraries like TensorFlow and PyTorch offer dropout
layers that can be easily integrated into your neural network architecture.
Choosing the Dropout Rate:
The optimal dropout rate (the percentage of neurons to drop) can vary depending
on the dataset and network architecture. Experimentation is often necessary to
find the best value for your specific model.
Overall, dropout is a simple yet effective regularization technique that plays

a vital role in training robust and generalizable neural networks. Its ease of
implementation and ability to address overfitting make it a popular choice
for deep learning practitioners.
Advantages of Dropout Regularization in Deep Learning

● Prevents Overfitting: By randomly disabling neurons, the network
cannot overly rely on the specific connections between them.
● Ensemble Effect: Dropout acts like training an ensemble of smaller
neural networks with varying structures during each iteration. This
ensemble effect improves the model’s ability to generalize to unseen
data.
● Enhancing Data Representation: Dropout methods are used to
enhance data representation by introducing noise, generating
additional training samples, and improving the effectiveness of the
model during training.
Drawbacks of Dropout Regularization and How to

Mitigate Them
1. Longer Training Times: Dropout increases training duration due to
random dropout of units in hidden layers. To address this, consider
powerful computing resources or parallelize training where possible.
2. Optimization Complexity: Understanding why dropout works is
unclear, making optimization challenging. Experiment with dropout
rates on a smaller scale before full implementation to fine-tune model
performance.
3. Hyperparameter Tuning: Dropout adds hyperparameters like dropout
chance and learning rate, requiring careful tuning. Use techniques such
as grid search or random search to systematically find optimal
combinations.
4. Redundancy with Batch Normalization: Batch normalization can
sometimes replace dropout effects. Evaluate model performance with
and without dropout when using batch normalization to determine its
necessity.
5. Model Complexity: Dropout layers add complexity. Simplify the model
architecture where possible, ensuring each dropout layer is justified by
performance gains in validation.
2. DROP CONNECT
DropConnect is a regularization technique for neural networks that builds upon the concept
of Dropout, offering an alternative approach to preventing overfitting. Here's how it compares
to Dropout:
Dropout vs. DropConnect:
● Dropout: Randomly deactivates (drops out) a certain percentage of neurons in

each layer during training. This forces the network to learn features that are not
dependent on any specific neuron.
● DropConnect: Randomly sets to zero a certain percentage of weights
(connections) in each layer during training. This encourages the network to develop
robust features by relying on a smaller subset of active connections at each training
iteration.
Similarities:
● Both techniques address overfitting by introducing randomness during training.

● Both encourage the development of robust features that are not dependent on
specific neurons or connections.
Differences:
● Target: Dropout targets neurons, while DropConnect targets weights (connections).

● Sparsity: Dropout encourages sparsity in the activations of the network, while
DropConnect encourages sparsity in the weights themselves.
Advantages of DropConnect:
● Potentially More Efficient: DropConnect might be computationally more efficient

than Dropout in some cases, especially for recurrent neural networks (RNNs).
● Improved Regularization: Some studies suggest that DropConnect might offer
stronger regularization compared to Dropout, leading to better performance on
unseen data.
Disadvantages of DropConnect:
● Less Common: DropConnect is not as widely used as Dropout, and there might be
fewer resources and implementations readily available.
● Potential Tuning Challenges: The hyperparameters for DropConnect might be
more sensitive to tuning compared to Dropout.
Choosing Between Dropout and DropConnect:
● Dropout is a well-established and well-understood technique. It's often the

default choice due to its simplicity and effectiveness.
● DropConnect can be a good alternative to explore if you're looking for potentially
stronger regularization or computational efficiency benefits, especially for RNNs.
● Experimentation is key. The best choice depends on your specific problem, dataset,
and network architecture.
Overall:
DropConnect provides an interesting alternative to Dropout for regularizing neural networks.

While it might not be as widely used, it's worth considering for specific scenarios where
stronger regularization or improved efficiency could be beneficial. Both techniques share the
core principle of introducing randomness during training to prevent overfitting and promote
robust feature learning.
3. Batch normalization
Batch normalization (BatchNorm) is a powerful technique used in neural

networks, but it doesn't directly fall under the category of regularization methods
like Dropout or L1/L2 regularization. However, it plays a crucial role in improving
the training process and can indirectly contribute to better generalization, which is
a key goal of regularization.
Here's how BatchNorm works and how it relates to regularization:
Core Function of BatchNorm:
● Normalizes Activations: BatchNorm normalizes the activations of

neurons within a layer during training. It essentially standardizes the
outputs of each activation function to have a mean of zero and a unit
standard deviation across a mini-batch of training data.
Indirect Regularization Effect:
While not a direct regularization technique, BatchNorm can indirectly contribute to

better generalization by:
● Reducing Sensitivity to Initialization: Well-normalized activations make

the network less sensitive to the initial weight values, allowing for a wider
range of reasonable initializations and potentially leading to better
generalization.
● Implicit Regularization: The normalization process itself might introduce
some implicit regularization effect by encouraging the network to learn
more robust features. However, this is an ongoing area of research.
Overall:
BatchNorm is a valuable technique for improving the training process of deep
neural networks. While not strictly a regularization method, its ability to normalize
activations, stabilize training, and allow for faster convergence with higher
learning rates can lead to models that generalize better on unseen data, which
aligns with the goals of regularization.
In conclusion, BatchNorm and regularization methods like Dropout and L1/L2

serve different purposes but ultimately work together to improve the training
process and generalization capabilities of deep neural networks.
### Advantages:
1. **Faster Training Convergence**: Stabilizes training, allowing higher learning

rates and faster convergence.
2. **Reduces Internal Covariate Shift**: Keeps activations centered, leading to

stable gradients and smoother optimization.
3. **Improved Gradients**: Ensures activations are on a similar scale, enhancing

gradient flow and weight updates.
4. **Less Sensitivity to Initialization**: Makes training less sensitive to initial

weight values, enabling robust training.
5. **Implicit Regularization**: May introduce regularization, helping learn robust

features and reducing overfitting.
### Disadvantages:
1. **Increased Training Time**: Additional computation for normalization slightly

increases iteration time.
2. **Reliance on Batch Size**: Effectiveness depends on batch size, with larger

sizes preferred but increasing memory usage.
3. **Not a Silver Bullet**: Not always beneficial, other techniques might be

needed.
4. **Concerns in Generative Models**: May hinder learning the full data
distribution, affecting output diversity.

ADL Unit-3

Uploaded by

Copyright:

Available Formats

ADL Unit-3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ADL Unit-3

Uploaded by

Copyright:

Available Formats

ADL: Unit-3

Newer Optimization Algorithms For Training Neural Network

Adagrad, short for Adaptive Gradient Algorithm, is an optimization

1. Accumulate Squared Gradients: Adagrad keeps track of the

A derivative of loss function for given parameters at a given time t.

1. Learning rate changes for each training parameter.

1. Computationally expensive as a need to calculate the second

Adagrad is a powerful optimizer for problems with sparse data, but be

AdaDelta, short for Adaptive Delta, is an optimization algorithm

The Problem with Adagrad:

We set γ to a similar value as the momentum term, around 0.9.

How AdaDelta Improves:

AdaDelta builds on the idea of Adagrad but introduces a key difference in

1. Decaying Average: Instead of accumulating all historical

Addressing SGD's Limitation:

Core Idea of RMSProp:

1. Track Squared Gradients: RMSProp maintains an

RMSprop as well divides the learning rate by an exponentially decaying average

2.handles noisy gradients well, and

3. requires fewer hyperparameters to tune

1.can be sensitive to learning rate

2.lacks momentum compared to some optimizers.

RMSProp is often seen as a simpler and more efficient alternative to Adam,

● Combines Strengths: Adam incorporates the benefits of both RMSProp (adaptive

● Adaptive Learning Rates: Similar to RMSProp, Adam maintains an exponentially

● Momentum: Adam also keeps an exponentially decaying average of past gradients,

Hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these

● Not a Guaranteed Winner: While powerful, Adam isn't a magic bullet.

Nesterov Accelerated Gradient (NAG) is a powerful optimization algorithm used

Here's a breakdown of NAG:

● Improves Momentum: It builds on the idea of momentum, which helps

SECOND ORDER METHODS FOR NEURAL NETWORK

Second-order optimization methods take a different approach. They additionally consider

However, there's a catch:

● Computational Cost: Calculating and utilizing the second derivative (Hessian) is

In detail about methods:

Advantages of Newton's Method:

Disadvantages of Newton's Method:

● Computational Cost: Calculating the derivative (especially the second derivative

2.Quasi-Newton methods are a group of optimization algorithms used in deep

Quasi-Newton methods take a middle ground:

Here's a breakdown of how Quasi-Newton methods work:

● L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno): This is a widely

Advantages of Quasi-Newton Methods:

Disadvantages of Quasi-Newton Methods:

Overall, Quasi-Newton methods offer a valuable compromise between the speed of

Hessian-free methods are a category of optimization algorithms used in deep learning

Popular Hessian-Free Methods:

● Krylov Subspace Methods: These methods utilize a mathematical framework

Advantages of Hessian-Free Methods:

● Scalability: By avoiding the Hessian calculation, they are significantly more

Disadvantages of Hessian-Free Methods:

SADDLE POINT PROBLEM

The saddle point problem is a significant challenge encountered during training

Here's a breakdown of the saddle point problem in neural networks:

● Loss Function Landscape: When training a neural network, we aim to minimize

Consequences of Saddle Points:

How to Deal with Saddle Points:

● Momentum: This technique incorporates the direction of previous updates,

1. Faster Training Convergence: Stabilizes training, allowing higher learning

2. Reduces Internal Covariate Shift: Keeps activations centered, leading to

3. Improved Gradients: Ensures activations are on a similar scale, enhancing

4. Less Sensitivity to Initialization: Makes training less sensitive to initial

5. Implicit Regularization: May introduce regularization, helping learn robust

1. Increased Training Time: Additional computation for normalization slightly

2. Reliance on Batch Size: Effectiveness depends on batch size, with larger

3. Not a Silver Bullet: Not always beneficial, other techniques might be