Gradient Descent Algorithms and Variations - PyImageSearch
Gradient Descent Algorithms and Variations - PyImageSearch
ARCH
TUTORIALS (HTTPS://PYIMAGESEARCH.COM/CATEGORY/TUTORIALS/)
6:29
In this tutorial, you will learn:
To learn about gradient descent and its variations, just keep reading.
There have been a tremendous number of variations of gradient descent and optimizers, ranging
from your vanilla gradient descent, mini-batch gradient descent, Stochastic Gradient Descent
(SGD), and mini-batch SGD, just to name a few.
Furthermore, entirely new model optimizers have been designed with improvements to SGD in
mind, including Adam, Adadelta, RMSprop, and others.
Today we are going to review the fundamentals of gradient descent and focus primarily on SGD,
including two improvements to SGD, momentum and Nesterov acceleration.
1 We start by taking our cost/loss function (i.e., the function responsible for computing the
value we want to minimize)
3 And finally, we take a step in the direction opposite of the gradient (since this will take us
down the path to our local minimum)
Figure 1: The goal of gradient descent to iteratively take steps towards lower areas of the loss landscape, similar
to descending to the bottom of a parabola, but in multiple dimensions (image source
(https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1)).
But how does this apply to neural networks and deep learning?
Let’s address that in the next section.
A neural network consists of one or more hidden layers. Each layer consists of a set of
parameters. Our goal is to optimize these parameters such that our loss is minimized.
Typical loss functions include binary cross-entropy (two-class classification), categorical cross-
entropy (multi-class classification), mean squared error (regression), etc.
There are many types of loss functions, each of which are used in certain roles. Instead of getting
too caught up in which loss function is being used, instead think of it this way:
2 We ask the neural network to make a prediction on a data point from our training set
3 We compute the prediction and then the loss/cost function, which tells us how good/bad of
a job we did at making the correct prediction
4 We compute the gradient off the loss
5 And then we ever-so-slightly tweak the parameters of the neural network such that our
predictions are better
We do this over and over again until our model is said to “converge” and is able to make reliable,
accurate predictions.
There are many types of gradient descent algorithms, but the types we’ll be focusing on here
today are:
3 Mini-batch SGD
The most basic form of gradient descent, which I like to call vanilla gradient descent, we only
update the weights of the network once per update.
In vanilla gradient descent we only update the network’s weights once per iteration, meaning that
the network sees the entire dataset every time a weight update is performed.
y g p p
If the number of training examples is large, then vanilla gradient descent is going to take a
long time to converge due to the fact that a weight update is only happening once per data
cycle.
Furthermore, the larger your dataset gets, the more nuanced your gradients can become, and if
you’re only updating the weights once per epoch then you’re going to be spending the
majority of your time computing predictions and not much time actually learning (which is the
goal of an optimization problem, right?)
Luckily, there are other variations of gradient descent that address this problem.
The original formulation of SGD would do N weight updates per epoch where N is equal to the
total number of data points in your dataset. So, using our example above, if we have N=10,000
images, then we would have 10,000 weight updates per epoch.
Until convergence:
Make a prediction on it
SGD tends to converge much faster because it’s able to start improving itself after each and
every weight update.
That said, performing N weight updates per epoch (where N is equal to the total number of data
points in our dataset) is also a bit computationally wasteful — we’ve now swung to the other side
of the pendulum.
Mini-batch SGD
Figure 3: Top: Vanilla gradient descent. Bottom: An illustration of mini-batch SGD with a batch size of S=3. At each itera
sampled, predictions are made, loss is computed, and parameters to the network are updated (image so
(https://kenndanielso.github.io/mlrefined/blog_posts/13_Multilayer_perceptrons/13_6_Stochastic_and_minibatch_gra
While SGD can converge faster for large datasets, we actually run into another problem — we
cannot leverage our vectorized libraries that make training super fast (again, because we are only
passing one data point at a time through the network).
There is a variation of SGD called mini-batch SGD that solves this problem. When you hear
people talking about SGD what they are almost always referring to is mini-batch SGD.
Mini-batch SGD introduces the concept of a batch size, S. Now, given a dataset of size N, there
will be a total of N / S updates to the network.
Until convergence:
g
If you visualize each mini-batch directly then you’ll see a very noisy plot, such as the following
one:
Figure 4: Plotting the loss of every mini-batch can lead to a very noisy plot (image source
(https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3)).
But when you average out the loss across all mini-batches the plot is actually quite stable:
Figure 5: Averaging the mini-batch loss over the course of an entire epoch leads to more stable-
looking plots (image source (https://towardsdatascience.com/gradient-descent-algorithm-and-
its-variants-10f652806a3)).
Note: Depending on what deep learning library you are using you may see both types of plots.
When you hear deep learning practitioners talking about SGD they are more than likely talking
about mini-batch SGD.
SGD has a problem when navigating areas of the loss landscape that are significantly steeper in
one dimension than in others (which you’ll see around local optima).
When this happens, it appears that SGD simply oscillates the ravine instead of descending into
areas of lower loss and ideally lower accuracy (see Sebastian Ruder’s excellent article
(https://ruder.io/optimizing-gradient-descent/) for more details on this phenomenon).
By applying momentum (Figure 6) we build up a head of steam in a direction and then allow
gravity to roll us faster and faster down the hill.
Get used to seeing momentum when using SGD — it is used in the majority of neural network
experiments that apply SGD.
The problem with momentum is that once you develop a head of steam, the train can easily
become out of control and roll right over our local minima and back up the hill again.
Nesterov acceleration accounts for this and helps us recognize when the loss landscape starts
sloping back up again.
Nearly all deep learning libraries that contain a SGD implementation also include momentum and
Nesterov acceleration terms.
Momentum is nearly always a good idea. Nesterov acceleration works in some situations and not
in others. You’ll want to treat them as hyperparameters you need to tune when training your
neural networks (i.e., pick values for each, run an experiment, log the results, update the
parameters, and repeat until you find a set of hyperparameters that yields good results).
Furthermore, we have an entire set of tutorials on hyperparameter optimization which you can
find here. (https://pyimagesearch.com/2021/05/17/introduction-to-hyperparameter-tuning-
with-scikit-learn-and-python/)
3:52
Course information:
79 total classes • 101+ hours of on-demand code walkthrough videos • Last updated:
August 2023
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision
and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming,
overwhelming, and complicated? Or has to involve complex mathematics and
equations? Or requires a degree in computer science?
All you need to master computer vision and deep learning is for someone to explain
things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to
change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be
PyImageSearch University, the most comprehensive computer vision, deep learning,
and OpenCV course online today. Here you’ll learn how to successfully and
confidently apply computer vision to your work, research, and projects. Join me in
computer vision mastery.
✓ 79 Certificates of Completion
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-
art techniques
✓ Run all code examples in your web browser — works on Windows, macOS, and
Linux (no dev environment configuration required!)
Summary
In this tutorial, you learned about gradient and descent and its variations, namely Stochastic
Gradient Descent (SGD).
Gradient Descent (SGD).
SGD is the workhorse of deep learning. All optimizers, including Adam, Adadelta, RMSprop, etc.,
have their roots in SGD — each of these optimizers provides tweaks and variations to SGD,
ideally improving convergence and making the model more stable during training.
We’ll cover these more advanced optimizers soon, but for the time being, understand that SGD is
the basis of all of them.
We can further improve SGD by including a momentum term (nearly always recommended).
Occasionally, Nesterov acceleration can further improve SGD (dependent on your specific
project).
To download the source code to this post (and be notified when future tutorials are published
here on PyImageSearch), simply enter your email address in the form below!
Join the PyImageSearch Newsletter and Grab My FREE 17-
page Resource Guide PDF
Enter your email address below to join the PyImageSearch Newsletter and download my
FREE 17-page Resource Guide PDF on Computer Vision, OpenCV, and Deep Learning.
About the Author
Hi there, I’m Adrian Rosebrock, PhD. All too often I see developers, students, and
researchers wasting their time, studying the wrong things, and generally struggling
to get started with Computer Vision, Deep Learning, and OpenCV. I created this
website to show you what I believe is the best possible way to get your start.
Previous Article:
(https://pyimagesearch.com/2021/05/03/face-recognition-with-local-binary-patterns-lbps-and-
opencv/)
Next Article:
(https://pyimagesearch.com/2021/05/06/understanding-weight-initialization-for-neural-
networks/)
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love
hearing from readers, a couple years ago I made the tough decision to no longer offer
1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post
comments. I simply did not have the time to moderate and respond to them all, and
the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and
OpenCV community at large by focusing my time on authoring high-quality blog
posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to
my full catalog of books and courses (https://pyimagesearch.com/books-and-
courses/) — they have helped tens of thousands of developers, students, and
researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Similar articles
(https://pyimagesearch.com/2019/11/18/fire-and-smoke-detection-with-keras-and-
deep-learning/)
TUTORIALS
(https://www.facebook.com/pyimagesearch)
(https://twitter.com/PyImageSearch) (http://www.linkedin.com/pub/adrian-
rosebrock/2a/873/59b) (https://www.youtube.com/channel/UCoQK7OVcIVy-nV4m-
SMCk_Q/videos)