Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
Machine Learning
Dr. P. K. Chaurasia
Associate Professor,
Department of Computer Science and
Information Technology
MGCUB, Motihari, Bihar
Objectives
Introduction
Optimization
Gradient Descent
Types of Gradient Descent
Batch Gradient Descent
Stochastic Gradient Descent
Review Questions
References
Introduction
The objective of optimization is to deal with real life
problems.
It means getting the optimal output for your problem.
In machine learning, optimization is slightly different.
Generally, while optimizing, we know exactly how our
data looks like and what areas we want to improve.
But in machine learning we have no clue how our “new
data” looks like, let alone try to optimize on it.
Therefore, in machine learning, we perform optimization
on the training data and check its performance on a new
validation data.
Optimization Techniques
There are various kinds of optimization
techniques, which is as follows:
Mechanics : Deciding the surface of aerospace design.
Economics : Cost Optimization
Physics : Time optimization in quantum computing.
• Various popular machine algorithm depends upon
optimization techniques like linear regression, neural
network, K-nearest neighbor etc.
• Gradient descent is the most common used optimization
techniques in machine learning.
Gradient Descent
δ //Convergence threshold Let θ=θt+1 =θt +h, thus f (θt+1) ≈ f (θt )+ h ∇f (θt )
Derivative of f(θt+1) wrt h is ∇f (θt )
)
At h = ∇f (θt ) a maximum occurs (since h2 is
1 t←1
positive) and at h = −∇f (θt ) a minimum occurs.
2do
Alternatively,
3θt+1 ← θt − η∇f θt( ) The slope ∇f (θt ) points to the
4 t ←t +1 direction of
5 while || θt − θt−1 ||> δ steepest ascent. If we take a step η in the
6 return θt( ) opposite direction we decrease the value of f
One-dimensional example
Let f(θ)=θ2
This function has minimum at θ=0 which we want
to determine using gradient descent
We have f ’(θ)=2θ
For gradient descent, we update by –f ’(θ)
If θt > 0 then θt+1<θt
If θt<0 then f ’(θt)=2θt is negative, thus θt+1>θt
Ex: Gradient Descent on Least Squares
1
• Criterion to minimize f(x) = |A x − b 2|
2
• The gradient is
1
Linear Regression
Logistic Regression.
Stochastic Gradient Descent
Gradient descent can be slow to run on very large datasets.
One iteration of the gradient descent algorithm requires a prediction for
each instance in the training dataset, it can take a long time when you
have many millions of instances.
When large amounts of data, you can use a variation of gradient descent
called stochastic gradient descent.
A few samples are selected randomly instead of the whole data set for
each iteration. In Gradient Descent, there is a term called “batch”
which denotes the total number of samples from a dataset that is used for
calculating the gradient for each iteration.
Stochastic Gradient Descent
Initialize w1
for k = 1 to K do
Update wK +1 ← wK − α∇fi(wK)
end for
Return wK.
Review Questions
List of Books
Understanding Machine Learning: From Theory to Algorithms.
Introductory Machine Learning notes
Foundations of Machine Learning
List of website for references
Bottou, Léon (1998). "Online Algorithms and Stochastic
Approximations". Online Learning and Neural Networks.
Cambridge University Press. ISBN 978-0-521-65263-6
Bottou, Léon. "Large-scale machine learning with stochastic
gradient descent." Proceedings of COMPSTAT'2010. Physica-
Verlag HD, 2010. 177-186.
Bottou, Léon. "Stochastic gradient descent tricks." Neural
Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012.
421-436.