Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia

Gradient Descent Algorithm in
Machine Learning
Dr. P. K. Chaurasia
Associate Professor,
Department of Computer Science and
Information Technology
MGCUB, Motihari, Bihar
Objectives
 Introduction
 Optimization
 Gradient Descent
 Types of Gradient Descent
 Batch Gradient Descent
 Stochastic Gradient Descent
 Review Questions
 References
Introduction
 The objective of optimization is to deal with real life
problems.
 It means getting the optimal output for your problem.
 In machine learning, optimization is slightly different.
 Generally, while optimizing, we know exactly how our
data looks like and what areas we want to improve.
 But in machine learning we have no clue how our “new
data” looks like, let alone try to optimize on it.
 Therefore, in machine learning, we perform optimization
on the training data and check its performance on a new
validation data.
Optimization Techniques
 There are various kinds of optimization
techniques, which is as follows:
 Mechanics : Deciding the surface of aerospace design.
 Economics : Cost Optimization
 Physics : Time optimization in quantum computing.
• Various popular machine algorithm depends upon
optimization techniques like linear regression, neural
network, K-nearest neighbor etc.
• Gradient descent is the most common used optimization
techniques in machine learning.
Gradient Descent
 Gradient descent is an optimization algorithm used to

find the values of parameters (coefficients) of a function
(f) that minimizes a cost function (cost).
 Gradient descent is best used when the parameters

cannot be calculated analytically (e.g. using linear
algebra) and must be searched for by an optimization
algorithm.
Gradient Descent
 Suppose a large bowl like what you would eat cereal out of or
store fruit in. This bowl is a plot of the cost function (f).
 A random position on the surface of the bowl is the cost of the
current values of the coefficients (cost).
 The bottom of the bowl is the cost of the best set of coefficients,
the minimum of the function.
Cont..
 The goal is to continue to try different values for the

coefficients, evaluate their cost and select new
coefficients that have a slightly better (lower) cost.
 Repeating this process enough times will lead to the

bottom of the bowl and you will know the values of
the coefficients that result in the minimum cost.
Gradient Descent
• Given function is f (x)=½ x2 which has a bowl shape with global
minimum at x=0
– Since f ’(x)=x
• For x>0, f(x) increases with x and f’(x)>0
• For x<0, f(x) decreases with x and f’(x)<0
• Use f’(x) to follow function downhill

– Reduce f (x) by going in direction opposite sign of derivative f’(x)
Minimizing with Multiple Inputs
• We often minimize functions with multiple inputs:

f: Rn →R
• For minimization to make sense there must still be
only one (scalar) output
Application in ML: Minimize Error
Negated gradient
Error surface shows desirability of

Direction in every weight vector, a parabola with
w0-w1 plane a single global minimum
producing
steepest
descent
• It determines a weight vector w that minimizes E(w) by

– Starting with an arbitrary initial weight vector.
– Repeatedly modifying it in small steps.
– At each step, weight vector is modified in the direction that
produces the steepest descent along the error surface.
Method of Gradient Descent
• The gradient points directly uphill, and the negative

gradient points directly downhill.
• Thus we can decrease function f by moving in the
direction of the negative gradient.
– This is known as the method of steepest descent or
gradient descent
• Steepest descent proposes a new point
x' = x − η∇x f (x)
– where ε is the learning rate, a positive scalar.
Set to a small constant.
Simple Gradient Descent
Procedure Gradient-Descent ( Intuition
θ1 //Initial starting point Taylor’s expansion of function f(θ) in th e
f //Function to be minimized neighborhood of θt is f (θ) ≈ f (θt )+(θ − θt )T ∇f (θt )
δ //Convergence threshold Let θ=θt+1 =θt +h, thus f (θt+1) ≈ f (θt )+ h ∇f (θt )
Derivative of f(θt+1) wrt h is ∇f (θt )
)
At h = ∇f (θt ) a maximum occurs (since h2 is
1 t←1
positive) and at h = −∇f (θt ) a minimum occurs.
2do
Alternatively,
3θt+1 ← θt − η∇f θt( ) The slope ∇f (θt ) points to the
4 t ←t +1 direction of
5 while || θt − θt−1 ||> δ steepest ascent. If we take a step η in the
6 return θt( ) opposite direction we decrease the value of f
One-dimensional example
Let f(θ)=θ2
This function has minimum at θ=0 which we want
to determine using gradient descent
We have f ’(θ)=2θ
For gradient descent, we update by –f ’(θ)
If θt > 0 then θt+1<θt
If θt<0 then f ’(θt)=2θt is negative, thus θt+1>θt
Ex: Gradient Descent on Least Squares
1
• Criterion to minimize f(x) = |A x − b 2|
2
– Least squares regression

N 2
1 T
E D(w) t w
n (xn )
2 n
• The gradient is
1
∇ x f (x)= AT (Ax −b) = A T

Ax − A Tb
• Gradient Descent algorithm is

1. Set step size ε, tolerance δ to small, positive nos.
2.while |A T Ax − A T b || > δ2do
(
x ← x − η AT Ax − AT b)
3.end while
Stationary points, LocalOptima
• When f’(x)=0 derivative provides no information
about direction of move
• Points where f’(x)=0 are known as stationary
or critical points
– Local minimum/maximum: a point where f(x) lower/
higher than all its neighbors
– Saddle Points: neither maxima nor minima
Presence of Multiple Minima
• Optimization algorithms may fail to find global

minimum
• Generally accept such solutions
Types of Gradient Descent Algorithms
 It can be classified by two methods:

 Batch Gradient Descent Algorithm
 Stochastic Gradient Descent Algorithm

 Batch gradient descent algorithms, use whole data at
once to compute the gradient, whereas in stochastic you
take a sample while computing the gradient.
Batch Gradient Descent
 The objectives of all supervised machine learning

algorithms is to best estimate a target function (f)
that maps input data (X) onto output variables
(Y).
 Some machine learning algorithms have
coefficients that characterize the algorithms
estimate for the target function (f).
Batch Gradient Descent
 Different algorithms have different representations and different coefficients,

but many of them require a process of optimization to find the set of coefficients
that result in the best estimate of the target function.
 Examples of algorithms with coefficients that can be optimized using gradient

descent are:
 Linear Regression
 Logistic Regression.
Stochastic Gradient Descent
 Gradient descent can be slow to run on very large datasets.
 One iteration of the gradient descent algorithm requires a prediction for
each instance in the training dataset, it can take a long time when you
have many millions of instances.
 When large amounts of data, you can use a variation of gradient descent
called stochastic gradient descent.
 A few samples are selected randomly instead of the whole data set for
each iteration. In Gradient Descent, there is a term called “batch”
which denotes the total number of samples from a dataset that is used for
calculating the gradient for each iteration.
Stochastic Gradient Descent
 Stochastic gradient descent selects an observation uniformly at

random, say i and uses fi(w) as an estimator for F(w). While
this is a noisy estimator, we are able to update the weights
much more frequency and therefore hope to converge more
rapidly.
 Updates takes only O(d) computation, though the total number
of iterations, T, is larger than in the Gradient Descent
algorithm.
Algorithm : Stochastic Gradient Descent
 Initialize w1
for k = 1 to K do
Sample an observation i uniformly at random
Update wK +1 ← wK − α∇fi(wK)
end for
Return wK.
Review Questions
 What is Optimization in Machine Learning?
 What is Gradient Descendent? Explain
 What are the different types of GDA? Explain.
 What is Batch Gradient Descent?
 What is stochastic gradient descent?
 Write an algorithm for SGD.

References
 List of Books
 Understanding Machine Learning: From Theory to Algorithms.
 Introductory Machine Learning notes
 Foundations of Machine Learning
 List of website for references
 Bottou, Léon (1998). "Online Algorithms and Stochastic
Approximations". Online Learning and Neural Networks.
Cambridge University Press. ISBN 978-0-521-65263-6
 Bottou, Léon. "Large-scale machine learning with stochastic
gradient descent." Proceedings of COMPSTAT'2010. Physica-
Verlag HD, 2010. 177-186.
 Bottou, Léon. "Stochastic gradient descent tricks." Neural
Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012.
421-436.

Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia

Uploaded by

Copyright:

Available Formats

Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia

Uploaded by

Copyright:

Available Formats

Gradient Descent Algorithm in

 Gradient descent is an optimization algorithm used to

 Gradient descent is best used when the parameters

 The goal is to continue to try different values for the

 Repeating this process enough times will lead to the

• Use f’(x) to follow function downhill

• We often minimize functions with multiple inputs:

Error surface shows desirability of

• It determines a weight vector w that minimizes E(w) by

• The gradient points directly uphill, and the negative

– Least squares regression

∇ x f (x)= AT (Ax −b) = A T

• Gradient Descent algorithm is

• Optimization algorithms may fail to find global

 It can be classified by two methods:

 Stochastic Gradient Descent Algorithm

 The objectives of all supervised machine learning

 Different algorithms have different representations and different coefficients,

 Examples of algorithms with coefficients that can be optimized using gradient

 Stochastic gradient descent selects an observation uniformly at

Sample an observation i uniformly at random

 What is Optimization in Machine Learning?

 What is Gradient Descendent? Explain

 What are the different types of GDA? Explain.

 What is Batch Gradient Descent?

 What is stochastic gradient descent?

 Write an algorithm for SGD.

You might also like