Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Module3_notes

Tjgg and the following undali antee to the new year holiday leedhAa and the following undali

Uploaded by

nickjason670
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module3_notes

Tjgg and the following undali antee to the new year holiday leedhAa and the following undali

Uploaded by

nickjason670
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Deep Learning Module 3: Optimization for Training Deep Models

Module 3: Optimization for Training Deep Models

Module 3: Optimization for Training Deep Models: Empirical Risk Minimization, Challenges in Neural
Network Optimization, Basic Algorithms: Stochastic Gradient Descent, Algorithms with Adaptive Learning
Rates.

Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.

(Chapters 8.1-8.5)

Optimization for Training Deep Models

• A deep learning model comprises multiple layers of interconnected neurons organized into layers. Each
neuron computes an activation function on the incoming data and passes the result to the next layer.
The activation functions introduce non-linearity, allowing for complex mappings between inputs and
outputs.
• The goal of training a Deep Learning model is to get the model output, as near as possible to the actual
output, for a certain collection of input-output pairings. The modelling task is done once the model's
output closely resembles the actual output.
• To assess how closely model predictions align with actual outputs, a mathematical function called a
loss function is used. This function indicates how well the model is performing at each stage of
training.
• While the optimization algorithm iteratively updates the model's parameters to reduce the loss on the training
data, the optimization algorithm’s focus is on finding the parameters θ of a neural network that significantly
minimize a cost function J(θ). This cost function not only measures performance across the entire training set
but often includes additional regularization terms to improve generalization.

How Learning Differs from Pure Optimization

1. Purpose:
• Learning aims to generalize from data or experience, creating models that can identify patterns,
predict results, or make decisions based on past examples or knowledge. Through learning, the
model adapts to new, previously unseen inputs.
Example: A machine learning model is trained on a dataset of images to recognize cats, and once
trained, it can recognize new cat images that were not in the training set.

Department of CSE, Vemana I.T Page 1 of 18


Deep Learning Module 3: Optimization for Training Deep Models
• Pure Optimization aims to find the optimal solution to a clearly defined problem, typically by
minimizing or maximizing an objective function. Unlike learning, it doesn’t involve
generalization, the goal is solely to solve the specific problem as efficiently as possible.

Example: A linear programming solver finds the minimum cost for transporting goods across a
network given specific supply and demand constraints.
2. Adaptability:
• Learning involves adapting to new data and adjusting the model to handle previously unseen
situations.
Example: A reinforcement learning agent improves its gameplay over time by refining its strategies
based on rewards and penalties, learning from each experience.
• Pure Optimization addresses a static problem within a fixed set of constraints. Once the optimal
solution is identified, it remains unchanged unless the problem itself is redefined.
Example: Solving the traveling salesman problem for a specific set of cities. The solution is tailored
to those cities and does not adapt to a new set without re-running the optimization process.
3. Iteration and Feedback:
• Learning: Involves continuous iterations, where feedback—such as errors or rewards—guides the
model to improve over time. This is common in machine learning and reinforcement learning.
• Pure Optimization: Involves iterative steps to identify the best solution, but the process concludes
once the solution is found. There is no ongoing feedback or further improvement unless the
problem is re-solved.
4. Data Dependency:
• Learning: Heavily relies on data to improve its performance and adapt to changes. Data drives
learning processes.
• Pure Optimization: Works based on a fixed set of parameters, constraints, and an objective
function. It doesn’t rely on data for learning but instead for solving a static problem.
5. Output:
• Learning: The output is a model or function that generalizes across various inputs and can make
predictions for new cases.
• Pure Optimization: The output is the best (optimal) solution for the specific problem defined by
the given inputs and constraints.
6. Uncertainty Handling:
• Learning: Often deals with uncertainty and probabilistic outcomes, such as in predictive models.
• Pure Optimization: Typically works with deterministic inputs, though there are stochastic
optimization methods to handle uncertainty.

Department of CSE, Vemana I.T Page 2 of 18


Deep Learning Module 3: Optimization for Training Deep Models

Empirical Risk Minimization

• Empirical Risk Minimization (ERM) is a core principle in machine learning and statistical learning theory.
It involves minimizing the "empirical risk," or observed loss, calculated from a given dataset.
• This contrasts with minimizing the "true" or expected risk, which depends on the unknown underlying
data distribution.
• Empirical Risk refers to the average loss calculated over a given dataset. It represents how well a model
performs on the observed data by summing up the errors made by the model for each example in the
dataset.
• Empirical risk serves as an estimate of the model's performance during training but may not fully reflect
its ability to generalize to unseen data.
• It is defined as the average loss over all examples in the dataset, where loss quantifies the error between
the model's predictions and the actual target values.
• The goal of many learning algorithms is to minimize this empirical risk during training.
• Risk and Expected Generalization Error:
• The true goal of machine learning is to minimize the expected generalization error, also known as the risk.
• Risk involves an expectation over the true data distribution pdata(x,y) which is often unknown.
• Challenge of Unknown True Distribution:
• If we had access to pdata(x, y), risk minimization would be a straightforward optimization task.
• In practice, we only have a training dataset sampled from pdata(x,y), making it a machine learning
problem.
• Empirical Risk Minimization (ERM):
• To solve this, ERM approximates the true risk by minimizing the empirical risk, which uses the empirical
distribution 𝑝̂ 𝑑𝑎𝑡𝑎(𝑥, 𝑦) derived from the training set.
• The empirical risk is defined as:

• where L(f(x;θ),y) is the loss function, m is the number of training samples, f(x;θ) is the model, and θ
represents the model parameters.
• Optimization Process: ERM reformulates the problem into minimizing the average training error. It
involves replacing p(x,y) with 𝑝̂ 𝑑𝑎𝑡𝑎(𝑥, 𝑦) making it an optimization problem.

Department of CSE, Vemana I.T Page 3 of 18


Deep Learning Module 3: Optimization for Training Deep Models

• Drawbacks of ERM:
• Overfitting: High-capacity models (like deep networks) can memorize the training data instead of
generalizing to unseen examples.
• Infeasibility for Some Loss Functions: For example, the 0-1 loss has no useful derivatives for
gradient descent.
• Modern Optimization in Deep Learning:
• Modern approaches often avoid pure ERM because of overfitting and practical limitations.
• Instead, methods like regularization, data augmentation, and alternative loss functions (e.g., cross-
entropy) are used to strike a balance between empirical risk and generalization.

Surrogate Loss Functions and Early Stopping

• Surrogate loss functions and early stopping are important techniques used in machine learning, particularly
when training models such as neural networks.
• A surrogate loss function is a differentiable approximation of a target loss function that might be difficult
to optimize directly. The purpose of using surrogate loss functions is to simplify the optimization process
while still producing good results for the original task.
• Why Use Surrogate Loss Functions?
• Non-differentiable Target Loss: Some loss functions, such as the 0-1 loss in classification (which
simply counts whether predictions are correct or not), are non-differentiable.
• This makes them unsuitable for gradient-based methods like stochastic gradient descent (SGD).
Surrogate loss functions provide a smooth and differentiable alternative.
• Computational Efficiency: Surrogate loss functions are often more computationally efficient to
optimize compared to the true loss function, making them practical for large-scale machine learning
problems.
• Generalization: Although surrogate loss functions do not directly minimize the desired objective (e.g.,
classification error), they often provide good generalization to unseen data when minimized.
• Early stopping is a form of regularization used to prevent overfitting during the training process. The idea
is to stop training before the model starts to overfit the training data.
• Why Use Early Stopping?
• Overfitting Prevention: As training progresses, the model’s performance on the training data improves,
but at some point, it may begin to memorize the training examples, leading to poor performance on unseen
test data.

Department of CSE, Vemana I.T Page 4 of 18


Deep Learning Module 3: Optimization for Training Deep Models

• Early stopping halts training at the point where the model generalizes best.
• Efficiency: Early stopping reduces computational costs by preventing unnecessary extra training steps that
do not lead to better generalization.

Challenges in Neural Network Optimization

• Optimizing neural networks can be challenging due to various factors that arise from the complexity of
the model, the high-dimensional parameter space, and the nature of the data.
1. Vanishing and Exploding Gradients:
• Gradients become very small as they backpropagate through deep networks, especially with sigmoid
or tanh activation functions. This slows learning in early layers.
• Common Cause: Activation functions that shrink gradients across layers.
• Exploding Gradients: Gradients grow excessively large, destabilizing the learning process and causing
divergence.
• Common Cause: Poor weight initialization or unsuitable activation functions.
2. Saddle Points:
• A saddle point is a point on the surface of the graph where the tangent plane is horizontal but the point
is neither a local maximum nor a local minimum.
• Problem: They slow down or trap optimization algorithms temporarily, leading to very slow progress
in training.
• The Figure 1 depicts the 3-dimensional structure of local minima, local maxima and saddle point.

Fig 1: local minima, maxima and saddle point

Department of CSE, Vemana I.T Page 5 of 18


Deep Learning Module 3: Optimization for Training Deep Models

3. Poorly Conditioned Loss Surfaces:

• A poorly conditioned loss surface occurs when the loss changes slowly in some directions and rapidly
in others. This disparity makes it difficult for optimizers to determine an appropriate step size.

• Effect: Leads to slow convergence or oscillations during training.

• Example: Loss surfaces shaped like a narrow valley, where gradients point in different directions.

4. Local Minima and Plateaus:


• Neural networks often face non-convex loss landscapes, which contain local minima and plateaus.
• These features can hinder or slow down convergence during training.
• Research suggests that in high-dimensional parameter spaces, many local minima are close to the
global minimum in terms of performance.
• Flat regions (plateaus), however, remain challenging as they slow down the training process
significantly.
5. Hyperparameter Tuning:
• Hyperparameter optimization is crucial for effective neural network training.
• Key hyperparameters include learning rate, batch size, momentum, weight decay, and regularization
coefficients.
• The high dimensionality of the hyperparameter space makes finding the optimal combination
challenging but necessary for good performance.
6. Overfitting:
• It is a common challenge in neural networks, where the model learns to perform exceptionally well on
the training data but fails to generalize to unseen data.
• This occurs when the model becomes too complex, capturing noise and specific patterns in the training
set that do not represent the broader dataset.
• Overfitting can be mitigated using techniques such as regularization, dropout, and early stopping, as
well as by ensuring a sufficient amount of diverse training data.
• Proper tuning of model complexity and hyperparameters is also essential to strike a balance between
underfitting and overfitting, enabling the model to generalize effectively.
7. Choice of Optimizer:
• Determines how the model updates its parameters during training.
• Common optimizers include SGD, Adam, and RMSprop, each suited to different scenarios.
• The choice depends on factors like problem complexity, dataset size, and desired convergence speed.

Department of CSE, Vemana I.T Page 6 of 18


Deep Learning Module 3: Optimization for Training Deep Models

8. Learning Rate Selection:

• A high learning rate can cause divergence or unstable training.

• A low learning rate results in slow convergence or suboptimal solutions.

• Techniques such as learning rate schedules (e.g., step decay, cosine annealing) or adaptive learning
rates (e.g., Adam, AdaGrad) can help balance convergence and stability.

9. Data Imbalance:

• Occurs when some classes in the dataset are overrepresented while others are underrepresented.

• Leads to biased models that favour majority classes.

• Can be addressed using strategies like:

▪ Oversampling minority classes or under sampling majority classes.


▪ Using class-weighted loss functions to penalize misclassification of minority classes.
▪ Generating synthetic data using methods like SMOTE (Synthetic Minority Oversampling
Technique).
10. Generalization and Transfer Learning:

• Generalization refers to a model's ability to perform well on unseen data, indicating that it has learned
patterns that apply beyond the training dataset.

• Achieving good generalization requires balancing model complexity, avoiding overfitting, and using
techniques like regularization, dropout, and sufficient training data.

• Transfer learning involves leveraging pre-trained models on large datasets to solve related tasks with
limited data.

o It reduces training time and often improves performance, especially in cases with small
datasets.

o Fine-tuning or freezing layers in the pre-trained model can be done based on the new task's
similarity to the original one.

11. Model Interpretability: Model interpretability focuses on understanding how a model makes its
predictions, crucial for trust and debugging. Complex models like deep neural networks are often
considered "black boxes," making interpretability challenging.

Department of CSE, Vemana I.T Page 7 of 18


Deep Learning Module 3: Optimization for Training Deep Models

Basic Algorithms: Stochastic Gradient Descent

➢ Gradient descent: Gradient descent is the simplest optimization algorithm which computes gradients of
loss function with respect to model weights and updates them by using the following formula:

wt = wt−1 − ε ∗ dwt

Where w is the weight vector, dwt is the gradient of w, ε is the learning rate, t is the iteration number

▪ To understand why gradient descent converges slowly, let us look at the example below of
a ravine where a given function of two variables should be minimised. A ravine is an area where the
surface is much steeper in one dimension than in another.

Fig 2: Example of an optimization problem with gradient descent in a ravine area.

▪ The starting point in figure 2 is depicted in blue and the local minimum is shown in black.
▪ From the image, we can see that the starting point and the local minima have different horizontal
coordinates and are almost equal vertical coordinates.
▪ Using gradient descent to find the local minima will likely make the loss function slowly oscillate
towards vertical axes.
▪ These bounces occur because gradient descent does not store any history about its previous gradients
making gradient steps more indeterministic on each iteration.
▪ This example can be generalized to a higher number of dimensions.
▪ As a consequence, it would be risky to use a large learning rate as it could lead to disconvergence.
➢ Stochastic Gradient Descent (SGD): Stochastic gradient descent (SGD) and its variants are probably the
most used optimization algorithms for machine learning in general and for deep learning in particular.

Department of CSE, Vemana I.T Page 8 of 18


Deep Learning Module 3: Optimization for Training Deep Models

▪ It addresses the computational inefficiency of traditional Gradient Descent methods when dealing with
large datasets in machine learning projects.
▪ In SGD, instead of using the entire dataset for each iteration, only a single random training example
(or a small batch) is selected to calculate the gradient and update the model parameters.
▪ This random selection introduces randomness into the optimization process, hence the term
“stochastic” in stochastic Gradient Descent.
▪ The advantage of using SGD is its computational efficiency, especially when dealing with large
datasets.
▪ By using a single example or a small batch, the computational cost per iteration is significantly reduced
compared to traditional Gradient Descent methods that require processing the entire dataset.

Algorithm 8.1 shows the stochastic gradient descent algorithm.

▪ Initialization: Randomly initialize the parameters of the model.

▪ Set Parameters: Determine the number of iterations and the learning rate (ε) for updating the
parameters.

▪ Stochastic Gradient Descent Loop: Repeat the following steps until the model converges or reaches
the maximum number of iterations:

▪ Shuffle the training dataset to introduce randomness.

▪ Iterate over each training example (or a small batch) in the shuffled order.

▪ Compute the gradient of the cost function with respect to the model parameters using the current training
example (or batch).

▪ Update the model parameters by taking a step in the direction of the negative gradient, scaled by the learning
rate.

▪ Evaluate the convergence criteria, such as the difference in the cost function between iterations of the gradient.

Department of CSE, Vemana I.T Page 9 of 18


Deep Learning Module 3: Optimization for Training Deep Models
▪ Return Optimized Parameters: Once the convergence criteria are met or the maximum number of iterations is
reached, return the optimized model parameters.

▪ In SGD, since only one sample from the dataset is chosen at random for each iteration, the path taken
by the algorithm to reach the minima is usually noisier than your typical Gradient Descent algorithm.
▪ But that doesn’t matter all that much because the path taken by the algorithm does not matter, as long
as we reach the minimum and with a significantly shorter training time.
▪ The below figure 3 depicts the optimization path taken by the stochastic gradient descent algorithm.

Fig 3: Optimization path of stochastic gradient descent.

▪ One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it usually took
a higher number of iterations to reach the minima, because of the randomness in its descent.
▪ Even though it requires a higher number of iterations to reach the minima than typical Gradient
Descent, it is still computationally much less expensive than typical Gradient Descent.
▪ Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning
algorithm.
▪ One issue with SGD is that it can oscillate and take a long time to converge to a minimum, especially
when the loss function has a complex structure or is highly non-convex.
▪ Nesterov Momentum is a technique that helps to mitigate this issue by adding a momentum term to
the update rule.
➢ Momentum: The method of momentum (Polyak, 1964) is designed to accelerate learning, especially in
the face of high curvature, small but consistent gradients, or noisy gradients.
▪ The momentum algorithm accumulates an exponentially decaying moving average of past gradients
and continues to move in their direction.
▪ Based on the example above, it would be desirable to make a loss function performing larger steps in
the horizontal direction and smaller steps in the vertical. This way, the convergence would be much
faster. This effect is exactly achieved by Momentum.

Department of CSE, Vemana I.T Page 10 of 18


Deep Learning Module 3: Optimization for Training Deep Models

▪ Momentum uses a pair of equations at each iteration:

▪ The first formula uses an exponentially moving average for gradient values dw.
▪ Basically, it is done to store trend information about a set of previous gradient values.
▪ The second equation performs the normal gradient descent update using the computed moving average
value on the current iteration. α is the learning rate of the algorithm.
▪ Momentum can be particularly useful for cases like the above. Imagine we have computed gradients
on every iteration like in the picture above.
▪ Instead of simply using them for updating weights, we take several past values and perform update in
the averaged direction.

Fig 4. Optimization with momentum.

▪ In practice, Momentum usually converges much faster than gradient descent. With Momentum, there
are also fewer risks in using larger learning rates, thus accelerating the training process.

Algorithm 8.2 shows the stochastic gradient descent algorithm with the momentum.

Department of CSE, Vemana I.T Page 11 of 18


Deep Learning Module 3: Optimization for Training Deep Models

➢ Nesterov Momentum:
▪ Nesterov Momentum is a technique that can improve the convergence speed of stochastic gradient
descent, a popular optimization algorithm used to train machine learning models.
▪ It was introduced by Yurii Nesterov in 1983 and has since become a widely used technique in machine
learning.
▪ Stochastic gradient descent (SGD) is an optimization algorithm that iteratively updates the parameters
of a model to minimize the loss function.
▪ In each iteration, SGD computes the gradient of the loss function with respect to the model parameters
and uses this gradient to update the parameters in the opposite direction.
▪ This process is repeated until the loss function reaches a minimum or another stopping criterion is met.
▪ One issue with SGD is that it can oscillate and take a long time to converge to a minimum, especially
when the loss function has a complex structure or is highly non-convex.
▪ Nesterov Momentum is a technique that helps to mitigate this issue by adding a momentum term to
the update rule.
▪ The momentum term is essentially a weighted average of the past gradients, with the weighting
decreasing exponentially as the gradients get further in the past.
▪ This helps to smooth out the oscillations and accelerate convergence by allowing the optimizer to take
larger steps in the direction of the minimum.
▪ To implement Nesterov Momentum, we simply add a momentum term to the update rule for the model
parameters.
▪ For example, if we are using SGD to update the parameters w with a learning rate lr, the update rule
with Nesterov Momentum would be:

v = momentum * v - lr * grad(w)

w=w+v

▪ Where v is the momentum term and momentum is a hyperparameter that controls the strength of the
momentum.
▪ A value of momentum=0 corresponds to regular SGD, while larger values of momentum correspond
to more aggressive momentum.
▪ In practice, Nesterov Momentum can significantly improve the convergence speed of SGD and is often
used in combination with other techniques such as learning rate scheduling and mini-batch training.

Department of CSE, Vemana I.T Page 12 of 18


Deep Learning Module 3: Optimization for Training Deep Models

Parameter Initialization Strategies:

▪ Parameter initialization in deep learning models refers to the process of setting initial values for the
weights and biases of the model's neurons or nodes before training.

▪ Proper parameter initialization is a critical aspect of training deep learning models.

▪ It can significantly impact the model's convergence, stability, and generalization performance, making
it an important consideration for building successful and well-performing AI systems.

▪ These weights and biases play a crucial role in how the model learns and generalizes from the data it
is trained on.

▪ The choice of parameter initialization can significantly impact the model's convergence speed, training
stability, and overall performance.

➢ Why parameter initialization matters:

▪ Convergence speed: Proper initialization can help the model converge to an optimal solution more
quickly.

▪ If the initial weights are too small or too large, it may lead to slow convergence, which means the
model will take longer to learn from the data.

▪ Avoiding vanishing or exploding gradients: During backpropagation, gradients are propagated


backward through the network to update the weights.

▪ If the initial weights are too small, it can cause the gradients to become extremely small (vanishing
gradients) as they propagate through each layer, leading to slow or stalled learning. On the other hand,
if the weights are too large, the gradients can become very large (exploding gradients), making the
learning process unstable.

▪ Training stability: Proper initialization can help stabilize the training process and make it less sensitive
to small changes in the data. This is especially important in deep neural networks, where the effects of
poor initialization can be amplified as information flows through multiple layers.

▪ Preventing biases: Biases are additional parameters in neural networks that help models fit the data
better. If biases are not initialized correctly, it can result in biased learning, leading to suboptimal or
skewed representations learned by the model.

Department of CSE, Vemana I.T Page 13 of 18


Deep Learning Module 3: Optimization for Training Deep Models

▪ Generalization performance: The choice of initialization can also impact the model's ability to
generalize to unseen data. If the initialization is biased towards the training data, the model might
struggle to perform well on new, unseen examples.

▪ List of common parameter initialization techniques:

▪ Zero Initialization: Setting all weights and biases to zero. However, this is generally not recommended
as it can lead to symmetry breaking and slow convergence.

▪ Random Initialization: Initializing weights and biases with random values drawn from a uniform or
Gaussian distribution. This is one of the most common initialization methods.

▪ Xavier/Glorot Initialization: Proposed by Xavier Glorot and Yoshua Bengio, this method scales the
random initial weights by the square root of the number of input and output connections of each neuron.
It works well for sigmoid and hyperbolic tangent activation functions.

▪ He Initialization: Proposed by Kaiming He et al., this method is similar to Xavier initialization but
scales the weights by the square root of twice the number of input connections. It is more suitable for
ReLU (Rectified Linear Unit) activation functions.

▪ Identity Initialization: Sets the weights of the hidden units to the identity matrix, and biases to zero.
This technique is often used in recurrent neural networks (RNNs).

▪ Normalized Initialization: Scaling the initial weights by the inverse of the square root of the number
of inputs to ensure an average unit norm in the network.

▪ Sparse Initialization: Setting a portion of the weights to zero randomly to encourage sparsity in the
network.

Algorithms with Adaptive Learning Rates:

▪ Neural network researchers have long realized that the learning rate was reliably one of the
hyperparameters that is the most difficult to set because it has a significant impact on model
performance. As discussed previously, the cost is often highly sensitive to some directions in parameter
space and insensitive to others.
▪ The momentum algorithm can mitigate these issues somewhat, but does so at the expense of
introducing another hyperparameter. In the face of this, it is natural to ask if there is another way. If
we believe that the directions of sensitivity are somewhat axis-aligned, it can make sense to use a
separate learning rate for each parameter, and automatically adapt these learning rates throughout the
course of learning.
Department of CSE, Vemana I.T Page 14 of 18
Deep Learning Module 3: Optimization for Training Deep Models

AdaGrad:

▪ Adaptive Gradient Algorithm (AdaGrad), introduced by Duchi et al., in 2011, provides an intuitive solution
to learning rate adjustment.
▪ In traditional Stochastic Gradient Descent (SGD), the same learning rate is applied to all parameters, which
may not be ideal. Some parameters may need to be updated quickly, while others require more delicate,
slower updates.
▪ This is where AdaGrad steps in. It adapts the learning rate to the parameters, performing smaller updates
for parameters associated with frequently occurring features, and larger updates for parameters associated
with infrequent features.
▪ AdaGrad deals with the aforementioned problem by independently adapting the learning rate for each
weight component.
▪ If gradients corresponding to a certain weight vector component are large, then the respective learning rate
will be small.
▪ Inversely, for smaller gradients, the learning rate will be bigger. This way, Adagrad deals with vanishing
and exploding gradient problems.
▪ Adagrad accumulates element-wise squares dw² of gradients from all previous iterations.
▪ During weight update, instead of using normal learning rate α, AdaGrad scales it by dividing α by the
square root of the accumulated gradients √vₜ.
▪ Additionally, a small positive term ε is added to the denominator to prevent potential division by zero.

▪ The below mentioned algorithm describes the working of the AdaGrad.

Algorithm 8.4 shows the AdaGrad algorithm.

Department of CSE, Vemana I.T Page 15 of 18


Deep Learning Module 3: Optimization for Training Deep Models

▪ The greatest advantage of AdaGrad is that there is no longer a need to manually adjust the learning rate as
it adapts itself during training.
▪ Nevertheless, there is a negative side of AdaGrad: the learning rate constantly decays with the increase of
iterations (the learning rate is always divided by a positive cumulative number). Therefore, the algorithm
tends to converge slowly during the last iterations where it becomes very low.

RMSProp (Root Mean Square Propagation):

▪ The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by
changing the gradient accumulation into an exponentially weighted moving average. AdaGrad is designed
to converge rapidly when applied to a convex function.
▪ When applied to a non-convex function to train a neural network, the learning trajectory may pass through
many different structures and eventually arrive at a region that is a locally convex bowl. AdaGrad shrinks
the learning rate according to the entire history of the squared gradient and may have made the learning
rate too small before arriving at such a convex structure.
▪ RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can
converge rapidly after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized
within that bowl. RMSProp is shown in its standard form in algorithm 8.5.

Algorithm 8.5 shows the RMSProp algorithm.

▪ Compared to AdaGrad, the use of the moving average introduces a new hyperparameter, ρ, that controls
the length scale of the moving average.
▪ Empirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep
neural networks. It is currently one of the go-to optimization methods being employed routinely by deep
learning practitioners.

Department of CSE, Vemana I.T Page 16 of 18


Deep Learning Module 3: Optimization for Training Deep Models

Adam Algorithm:

▪ Adam (Kingma and Ba, 2014) is yet another adaptive learning rate optimization algorithm and is presented
in algorithm 8.7.

Algorithm 8.7 shows the Adam algorithm.

▪ The name “Adam” derives from the phrase “adaptive moments.” In the context of the earlier algorithms,
it is perhaps best seen as a variant on the combination of RMSProp and momentum with a few important
distinctions.
▪ First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with
exponential weighting) of the gradient.
▪ The most straightforward way to add momentum to RMSProp is to apply momentum to the rescaled
gradients.
▪ The use of momentum in combination with rescaling does not have a clear theoretical motivation.
▪ Second, Adam includes bias corrections to the estimates of both the first-order moments (the momentum
term) and the (uncentered) second-order moments to account for their initialization at the origin.
▪ RMSProp also incorporates an estimate of the (uncentered) second-order moment, however it lacks the
correction factor.
▪ Thus, unlike in Adam, the RMSProp second-order moment estimate may have high bias early in training.
▪ Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning
rate sometimes needs to be changed from the suggested default.

Department of CSE, Vemana I.T Page 17 of 18


Deep Learning Module 3: Optimization for Training Deep Models

Choosing the Right Optimization Algorithm:

The above topics discussed a series of related algorithms that each seek to address the challenge of optimizing
deep models by adapting the learning rate for each model parameter. At this point, a natural question is: which
algorithm should one choose? Unfortunately, there is currently no consensus on this point. Schaul et al. (2014)
presented a valuable comparison of a large number of optimization algorithms across a wide range of learning
tasks. While the results suggest that the family of algorithms with adaptive learning rates (represented by
RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged. Currently, the most
popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp
with momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point, seems to depend
largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).

Department of CSE, Vemana I.T Page 18 of 18

You might also like