ADL Unit-3
ADL Unit-3
ADL Unit-3
IMPROVED OPTIMIZATION
1.Adagrad:
Core Idea: Unlike standard optimizers with a constant learning rate for
all parameters, Adagrad adapts the learning rate individually for each
parameter based on its historical gradients.
How it Works:
η is a learning rate which is modified for given parameter θ(i) at a given time based
on previous gradients calculated for given parameter θ(i).
We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ϵ
is a smoothing term that avoids division by zero (usually on the order of 1e−8).
Interestingly, without the square root operation, the algorithm performs much
worse.
It makes big updates for less frequent parameters and a small step for frequent
parameters.
Advantages:
Disadvantages:
Overall:
2.adadelta
Adagrad adapts the learning rate for each parameter based on the history
of its squared gradients. This is great for sparse data, but it can lead to a
problem called diminishing learning rates. As training progresses,
the sum of squared gradients keeps growing, causing the learning rates
to constantly decrease. Eventually, the learning rates become too small,
hindering further improvement and stalling the training process.
In this exponentially moving average is used rather than the sum of all
the gradients.
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
Advantages:
1. Now the learning rate does not decay and the training does not
stop.
Disadvantages:
1. Computationally expensive.
3.rmsprop:RMSProp, which stands for Root Mean Square Prop, is an
optimization algorithm used for training neural networks. It addresses a
challenge faced by Stochastic Gradient Descent (SGD) and aims to
accelerate learning.
SGD can struggle with gradients that keep oscillating or decreasing for
certain weights. This can hinder learning progress in those directions.
RMSProp tackles this issue by adapting the learning rate for each weight
based on its recent squared gradients.
RMSprop and Adadelta have both been developed independently around the
same time stemming from the need to resolve Adagrad's radically diminishing
learning rates
Advantages:
1.faster convergence
Disadvantages
Overall, RMSProp is a powerful optimizer that can improve the efficiency and
stability of training compared to SGD. It's a good choice for various neural
network architectures, particularly when dealing with oscillating or noisy
gradients.
4.adam:
Adam (Adaptive Moment Estimation) is a popular optimization algorithm for training neural
networks, combining the strengths of RMSProp and momentum to deliver efficient and
stable learning.
Core Idea:
Adam computes adaptive learning rates for each parameter. In addition to storing
an exponentially decaying average of past squared gradients vt like Adadelta and
RMSprop, Adam also keeps an exponentially decaying average of past gradients
mt, similar to momentum. Whereas momentum can be seen as a ball running
down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat
minima in the error surface.
mt and vt are estimates of the first moment (the mean) and the second moment
(the uncentered variance) of the gradients respectively, hence the name of the
method.
Advantages:
● Speedy Convergence: Adam combines adaptive learning rates and
momentum, often leading to much faster training compared to vanilla SGD.
● Jack-of-All-Trades: It performs well across various neural network tasks,
making it a versatile choice for many applications.
● Less Tuning Hassle: Compared to some optimizers, Adam requires fewer
hyperparameters to fine-tune, simplifying the setup process.
Disadvantages:
5.NAG
—--------------
The idea of the NAG algorithm is very similar to SGD with momentum with a
slight variant. In the case of SGD with a momentum algorithm, the momentum
and gradient are computed on the previous updated weight.
Momentum may be a good method but if the momentum is too high the algorithm
may miss the local minima and may continue to rise up. So, to resolve this issue
the NAG algorithm was developed. It is a look ahead method. We know we’ll be
using γ.V(t−1) for modifying the weights so, θ−γV(t−1) approximately tells us the
future location. Now, we’ll calculate the cost based on this future parameter rather
than the current one.
V(t) = γ.V(t−1) + α. ∂(J(θ − γV(t−1)))/∂θ
and then update the parameters using θ = θ − V(t)
Again, we set the momentum term γγ to a value of around 0.9. While Momentum
first computes the current gradient (small brown vector in Image 4) and then
takes a big jump in the direction of the updated accumulated gradient (big brown
vector), NAG first makes a big jump in the direction of the previously accumulated
gradient (green vector), measures the gradient and then makes a correction (red
vector), which results in the complete NAG update (red vector). This anticipatory
update prevents us from going too fast and results in increased responsiveness,
which has significantly increased the performance of RNNs on a number of tasks.
—----------------
Advantages
1.faster convergence and
2.handles uneven loss landscapes well
Disadvantages
1.may not always be the best choice
2.has some overlap with Adam's functionality.
Here are some specific types of second-order methods used in deep learning:
● Newton's Method: The most well-known second-order method, offering the fastest
theoretical convergence but also the most computationally expensive due to the
exact Hessian calculation.
● Quasi-Newton Methods: These methods approximate the Hessian using
information from past gradients, making them more scalable than Newton's Method
but potentially less accurate.
● Hessian-Free Methods: These avoid calculating the Hessian entirely, using
gradients and function values for updates. They offer better scalability but might have
slower convergence.
Overall, second-order methods are still an area of active research in deep learning.
While not always practical for large models due to computational limitations, they
offer valuable insights into the optimization landscape and can be useful for smaller
models or specific tasks where faster convergence is a priority.
Here's how it works for finding the root (x-intercept) of a function f(x) which means solving
the equation f(x) = 0:
1. Initial Guess: You start with an initial guess (x₀) of where the root might be. This can
be based on your knowledge of the function or pure intuition.
2. Calculate Slope and Line: Imagine the graph of the function f(x). At your initial
guess (x₀), Newton's method calculates the slope of the tangent line to the function at
that point. This slope is obtained using the first derivative of f(x), evaluated at x₀
(f'(x₀)).
3. Intersection with X-axis: The tangent line is then extended until it intersects the
x-axis. This intersection point (x₁) represents a better estimate of the root compared
to your initial guess.
4. Repeat: You use the newly found x₁ as your next guess and repeat the process
(calculate slope, extend tangent line, find new intersection). The idea is that with
each iteration, you get closer and closer to the actual root.
● Fast Convergence: In theory, Newton's method can converge to the root very
quickly, especially if your initial guess is close to the actual solution.
In summary:
While Newton's method offers the potential for faster convergence, its computational cost
and sensitivity to noise make it less favorable for training large deep learning models.
However, it remains a valuable concept in understanding optimization techniques and can be
useful for smaller models or specific tasks where faster convergence is crucial.
● Goal: Like Newton's method, they aim to find the minimum of a function (often the
loss function in deep learning).
● Approach: Instead of calculating the exact Hessian at every step, they approximate
it using information from past gradients. This makes them more scalable for larger
models compared to Newton's method.
1. Similar to Gradient Descent: They start with an initial guess for the parameters and
iteratively update them in the direction that minimizes the loss function.
2. Leveraging Past Information: Unlike SGD (Stochastic Gradient Descent) which
only uses the current gradient, Quasi-Newton methods incorporate information from
past gradients to build an approximation of the Hessian.
3. Updating the Approximation: As the optimization progresses, the approximation of
the Hessian is refined based on new gradient information.
Popular Quasi-Newton Methods:
● Faster Convergence: Compared to SGD, they can converge to the minimum point
of the loss function quicker, especially for well-behaved loss functions.
● More Scalable: Their use of approximated Hessians makes them more suitable for
training larger deep learning models compared to Newton's method.
● Still Computationally Expensive: While more efficient than Newton's method, they
are still more computationally demanding than SGD.
● Approximation Accuracy: The accuracy of the Hessian approximation can affect
the convergence rate. For complex loss functions, the approximation might not be
very accurate, hindering performance.
3.Hessian-Free Methods:
Core Idea:
Hessian-free methods bypass the explicit calculation of the Hessian entirely. Instead,
they utilize information readily available during training, such as gradients and
function values, to update the model's parameters and guide them towards the
minimum of the loss function.
How it Works:
1. Similar Starting Point: These methods often begin with an initial guess for the
model parameters, similar to other optimization algorithms.
2. Leveraging Gradients: They utilize the gradients of the loss function with
respect to the parameters. The gradient indicates the direction of steepest
descent, providing valuable information about how to update the parameters to
minimize the loss.
3. Additional Information: Some Hessian-free methods might also incorporate
additional information beyond gradients, such as curvature information
obtained through techniques like finite differences, to improve the update
direction.
4. Iterative Updates: The parameters are then updated iteratively based on the
calculated direction and a chosen step size. As the optimization progresses,
the updates become more refined, gradually moving towards the minimum.
● Convergence Rate: While often faster than SGD, their convergence rate might
be slower than exact second-order methods due to the lack of a complete
Hessian picture.
● Hyperparameter Tuning: These methods might require careful tuning of
hyperparameters to achieve optimal performance.
Overall, Hessian-free methods offer a promising approach for training deep learning
models. They combine the efficiency of first-order methods (like SGD) with some of
the convergence benefits of second-order methods, making them a valuable tool for
various deep learning tasks.
https://www.youtube.com/watch?v=ktxztPzQg6o&ab_channel=LearningMonkey
Eqn in video
● Slows Down Training: If the optimization algorithm gets stuck in a saddle point,
it can significantly slow down the training process, hindering the model's
ability to learn effectively.
● Suboptimal Performance: Even if the algorithm escapes a saddle point, it might
not reach the true minimum point, leading to suboptimal performance for the
neural network.
Regularization Methods
Methods:
Overfitting occurs when a neural network model learns the training data too well,
including the noise and irrelevant details. This can lead to poor performance on
unseen data, the ultimate test of a model's effectiveness.
3. New Network Every Iteration: Since different neurons are dropped out at
each training step, the network effectively encounters a new architecture
during each iteration. This helps prevent the model from becoming overly
reliant on specific features or connections.
4. No Need for Specific Neuron Selection: Unlike L1 regularization (which
encourages sparsity by driving some weights to zero), dropout doesn't
explicitly select which neurons are important. The random deactivation
forces the network to learn robust representations across different subsets
of neurons.
Implementation:
Dropout is typically implemented during training only, not during inference (using
the model for prediction). Libraries like TensorFlow and PyTorch offer dropout
layers that can be easily integrated into your neural network architecture.
The optimal dropout rate (the percentage of neurons to drop) can vary depending
on the dataset and network architecture. Experimentation is often necessary to
find the best value for your specific model.
2. DROP CONNECT
DropConnect is a regularization technique for neural networks that builds upon the concept
of Dropout, offering an alternative approach to preventing overfitting. Here's how it compares
to Dropout:
Similarities:
Advantages of DropConnect:
Disadvantages of DropConnect:
● Less Common: DropConnect is not as widely used as Dropout, and there might be
fewer resources and implementations readily available.
● Potential Tuning Challenges: The hyperparameters for DropConnect might be
more sensitive to tuning compared to Dropout.
Overall:
Overall:
BatchNorm is a valuable technique for improving the training process of deep
neural networks. While not strictly a regularization method, its ability to normalize
activations, stabilize training, and allow for faster convergence with higher
learning rates can lead to models that generalize better on unseen data, which
aligns with the goals of regularization.
### Advantages:
### Disadvantages: