0% found this document useful (0 votes)

55 views

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

3.

GRADIENT-BASED OPTIMIZATION

 Most deep learning algorithms involve optimization of some sort.

 Optimization refers to the task of either minimizing or maximizing some function f(x) by
altering x.
 We usually phrase most optimization problems in terms of minimizing f(x).
 Maximization may be accomplished via a minimization algorithm by minimizing - f(x).
 The function we want to minimize or maximize is called the objective function or
criterion.
 When we are minimizing it, we may also call it the cost function, loss function, or error
function.

For example: We might say x ∗ = arg min f (x)

 S
uppose
we have a
function
y = f(x),
where
both x
and y are
real
numbers.

 T
he derivative of this function is denoted as f’(x) or as dy/ dx. The derivative f‘(x) gives the
slope of f (x) at the point x.

 In other words, it specifies how to scale a small change in the input in order to obtain the
corresponding change in the output: f(x +∈ ) ≈ f(x) +∈f’(x).

 The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y.
For example:

 We know that f (x − ∈ sign (f ∈(x))) is less than f (x) for small enough∈.

 We can thus reduce f (x) by moving x in small steps with opposite sign of the derivative.

 When f ‘(x) = 0, the derivative provides no information about which direction to move.
Points

 Where f ‘(x) = 0 are known as critical points or stationary points.

 A local minimum is a point where f (x) is lower than at all neighboring points, so it is no
longer possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points.
 The directional derivative in direction u (a unit vector) is the slope of the function f in
direction u.

 In other words, the directional derivative is the derivative of the function f(x + αu) with
respect to α, evaluated at α = 0.

 Using the chain rule, we can see that ∂ /∂α f(x + αu) evaluates to u T∇xf(x) when α = 0.

 To minimize f, we would like to find the direction in which f decreases the fastest. We can
do this using the directional derivative:

Where θ is the angle between u and the gradient.

 Substituting in ||u||2 = 1 and ignoring factors that do not depend on u, this simplifies to
minu cos θ. This is minimized when u points in the opposite direction as the gradient. In
other words, the gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent. Steepest descent proposes a
new p

where ∈ is the learning rate, a positive scalar determining the size of the step. We can
choose ∈ in several different ways. A popular approach is to set ∈ to a small constant.
Beyond the Gradient: Jacobian and Hessian Matrices

 Jacobian and Hessian Matrices Sometimes we need to find all of the partial derivatives of
a function whose input and output are both vectors.

 The matrix containing all such partial derivatives is known as a Jacobian matrix.

 Specifically, if we have a function f : R m → Rn , then the Jacobian matrix J ∈ Rn×m of f

is defined such that Ji,j = ∂/ ∂xj f(x)i.

 We are also sometimes interested in a derivative of a derivative. This is known as a second

derivative.
For example:- For a function f : R n → R, the derivative with respect to xi of the derivative of f
with respect to xj is denoted as ∂2 / ∂xi ∂xj f.
 In a single dimension, we can denote d2 /dx2 f by f “(x).
 The second derivative tells us how the first derivative will change as we vary the input.

Suppose we have a quadratic function

o If such a function has a second derivative of zero, then there is no curvature.

o It is a perfectly flat line, and its value can be predicted using only the gradient.

o If the gradient is 1, then we can make a step of size ∈ along the negative gradient, and the
cost function will decrease by ∈ .

o If the second derivative is negative, the function curves downward, so the cost function
will actually decrease by more than ∈.

o Finally, if the second derivative is positive, the function curves upward, so the cost
function can decrease by less than ∈.
When our function has multiple input dimensions, there are many second derivatives. These
derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix
H(f)(x) is defined such that

 Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can
decompose it into a set of real eigenvalues and an orthogonal basis eigenvectors.

 The second derivative in a specific direction represented by a unit vector d is given by

dTHd.

 When d is an eigenvector of H , the second derivative in that direction is given by the

corresponding eigenvalue.

 For other directions of d the directional second derivative is a weighted average of all of
the eigenvalues, with weights between 0 and 1, and eigenvectors that have smaller angle
with d receiving more weight.

 The maximum eigenvalue determines the maximum second derivative and the minimum
eigenvalue determines the minimum second derivative.

 The (directional) second derivative tells us how well we can expect a gradient descent step
to perform.
 We can make a second-order Taylor series approximation to the function f(x) around the
current point x (0) :

 When gTHg is zero or negative, the Taylor series approximation predicts that increasing ∈
forever will decrease f forever.

 When gTHg is positive, solving for the optimal step size that decreases the Taylor series
approximation of the function the most yields
Figure: A saddle point containing both positive and negative curvature.

 The function in this example is f (x) = x2 1 − x2 2 . Along the axis corresponding to x1, the
function curves upward.

Figure : Gradient descent fails to exploit the curvature information

 The simplest method for doing so is known as Newton’s method. Newton’s method is
based on using a second-order Taylor series expansion to approximate f(x) near some
point x (0) :

Optimization algorithms that use only the gradient such as gradient descent are called first-
order optimization algorithms.

 Optimization algorithms that also use the Hessian matrix, such as Newton’s method are
called second-order optimization algorithms.
 In the context of deep learning, we sometimes gain some guarantees by restricting
ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous
derivatives.

 Convex optimization algorithms are able to provide many more guarantees by making
stronger restrictions.

 Convex optimization algorithms are applicable only to convex functions -- functions for
which the Hessian is positive semidefinite everywhere.

 Such functions are well-behaved because they lack saddle points and all of their local
minima are necessarily global minima.

 However, most problems in deep learning are difficult to express in terms of convex
optimization. Convex optimization is used only as a subroutine of some deep learning
algorithms.

 Ideas from the analysis of convex optimization algorithms can be useful for proving the
convergence of deep learning algorithms.

 However, in general, the importance of convex optimization is greatly diminished in the

context of deep learning.

Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
4.2 Gradient-Based Optimization
No ratings yet
4.2 Gradient-Based Optimization
35 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
ML Notes
No ratings yet
ML Notes
14 pages
Chapter 2 Optimization
No ratings yet
Chapter 2 Optimization
47 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Optimization
No ratings yet
Optimization
21 pages
Optimization Algorithm 0401
No ratings yet
Optimization Algorithm 0401
26 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent
No ratings yet
Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent
16 pages
Math Optimization
No ratings yet
Math Optimization
11 pages
Optimization2
No ratings yet
Optimization2
40 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Maths For ML
No ratings yet
Maths For ML
1 page
07_Gradient_Descent_For_Linear_Regression_10_min
No ratings yet
07_Gradient_Descent_For_Linear_Regression_10_min
5 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
Ann2018 L5
No ratings yet
Ann2018 L5
23 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
nn1
No ratings yet
nn1
21 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
LInear
No ratings yet
LInear
14 pages
18 Vector Calculus and Optimization
No ratings yet
18 Vector Calculus and Optimization
6 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Newtons Method
No ratings yet
Newtons Method
8 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
mit18_s096iap23_lec4
No ratings yet
mit18_s096iap23_lec4
14 pages
4-Optimization of 2 Variables, Gradient Descent
No ratings yet
4-Optimization of 2 Variables, Gradient Descent
12 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
06_Gradient_Descent_Intuition_12_min
No ratings yet
06_Gradient_Descent_Intuition_12_min
5 pages
Optimization: 1 Motivation
No ratings yet
Optimization: 1 Motivation
20 pages
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
No ratings yet
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
10 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Memory Technology
No ratings yet
Memory Technology
26 pages
Depende Bali Ty
No ratings yet
Depende Bali Ty
9 pages
Trends in Power and Energy in Integrated Circuits
No ratings yet
Trends in Power and Energy in Integrated Circuits
21 pages
7& 9 Autoencoder and Variational Autoencoder
No ratings yet
7& 9 Autoencoder and Variational Autoencoder
13 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
The Bhopal School of Social Sciences, Bhopal
No ratings yet
The Bhopal School of Social Sciences, Bhopal
2 pages
Candisc
No ratings yet
Candisc
39 pages
84 Vector and Parametric Equations of A Plane
No ratings yet
84 Vector and Parametric Equations of A Plane
2 pages
Instant Download Abstract Algebra Structures and Applications 1st Edition Stephen Lovett PDF All Chapter
100% (3)
Instant Download Abstract Algebra Structures and Applications 1st Edition Stephen Lovett PDF All Chapter
74 pages
MATH Sets
No ratings yet
MATH Sets
22 pages
Jean-Pierre Serre
No ratings yet
Jean-Pierre Serre
12 pages
Generalized Rodrigues Formula Solutions
No ratings yet
Generalized Rodrigues Formula Solutions
12 pages
Desmos-Only-Drill-2-tbu47i
No ratings yet
Desmos-Only-Drill-2-tbu47i
11 pages
DPSD Course File Up PDF
No ratings yet
DPSD Course File Up PDF
365 pages
Bcon 2
No ratings yet
Bcon 2
468 pages
C3 and C4 MT Rev
0% (1)
C3 and C4 MT Rev
16 pages
1.5 Functions - Concepts - Piecewise
No ratings yet
1.5 Functions - Concepts - Piecewise
18 pages
Modul Topikal Kertas 2 SPM 2020 PDF
25% (4)
Modul Topikal Kertas 2 SPM 2020 PDF
243 pages
Notes 295A PDF
No ratings yet
Notes 295A PDF
31 pages
Hartman-Grobman Theorem and Normal Forms
No ratings yet
Hartman-Grobman Theorem and Normal Forms
18 pages
Sudoku, Gerechte Designs, Resolutions, Affine Space, Spreads, Reguli, and Hamming Codes
No ratings yet
Sudoku, Gerechte Designs, Resolutions, Affine Space, Spreads, Reguli, and Hamming Codes
33 pages
Mathematics Worksheet MCQ Class 10
No ratings yet
Mathematics Worksheet MCQ Class 10
29 pages
Binomial Theorem - DPP 02 (Of Lec 03) (Arjuna JEE 2023)
No ratings yet
Binomial Theorem - DPP 02 (Of Lec 03) (Arjuna JEE 2023)
3 pages
Presentation: Course Title: Course Code
No ratings yet
Presentation: Course Title: Course Code
17 pages
Example of A Linear Function
No ratings yet
Example of A Linear Function
4 pages
Solved Exam 2021 PDF
No ratings yet
Solved Exam 2021 PDF
4 pages
Abouzaid CV
No ratings yet
Abouzaid CV
3 pages
List of Math Symbols
No ratings yet
List of Math Symbols
10 pages
Shear Locking: Shear Locking, Aspect Ratio Stiffening, and Qualitative Errors
No ratings yet
Shear Locking: Shear Locking, Aspect Ratio Stiffening, and Qualitative Errors
18 pages
Worksheet 3 LA
No ratings yet
Worksheet 3 LA
2 pages
Arni University::: B.Tech Mechanical II SEMESTER
No ratings yet
Arni University::: B.Tech Mechanical II SEMESTER
3 pages
Numerical Methods For Engineers: (6th Edition)
No ratings yet
Numerical Methods For Engineers: (6th Edition)
5 pages
Young Tableaux
No ratings yet
Young Tableaux
4 pages
Calculus 1: Functions
No ratings yet
Calculus 1: Functions
2 pages
ECON 262 Outline
No ratings yet
ECON 262 Outline
3 pages