Pure Optimization

Uploaded by

uma maheshwari

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Pure Optimization

Uploaded by

uma maheshwari

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Optimization Rule in Deep Neural Network

In machine learning, optimizers and loss functions are two components that help improve the performance of
the model.
By calculating the difference between the expected and actual outputs of a model, a loss function evaluates
the effectiveness of a model.
Among the loss functions are log loss, hinge loss, and mean square loss.
By modifying the model’s parameters to reduce the loss function value, the optimizer contributes to its
improvement.
RMSProp, ADAM, and SGD are a few examples of optimizers.
The optimizer’s job is to determine which combination of the neural network’s weights and biases will give it
the best chance to generate accurate predictions.
Pure Optimization
1.objective function
2.optimization algorithams
3.learning rate scheduling
4.regularization techiniques
5.hyperparameter tuning
6.advance technique
7.distributed and parllel optimization
8.neural architecture search
pure optimization
1. Objective Function
The objective function, often a loss function, measures how well the
model performs.
Examples include Mean Squared Error (MSE) for regression and
Cross-Entropy Loss for classification.
2. Optimization Algorithms
Gradient Descent: The most basic optimization method, adjusting
model parameters based on the gradient of the loss function.
Variants of Gradient Descent:
• Momentum: Accumulates gradients over time to smooth out updates.
• Adam (Adaptive Moment Estimation): Combines momentum and
adaptive learning rates, making it very popular for training deep
networks.
• 3. Learning Rate Scheduling:
• Adjusting the learning rate during training can lead to better
convergence. Techniques include:
• Step Decay: Reducing the learning rate at fixed intervals.
• Exponential Decay: Gradually decreasing the learning rate according to
an exponential function.
• Cyclic Learning Rates: Varying the learning rate cyclically within a
range.
• 4. Regularization Techniques
• To prevent overfitting, regularization methods like L1 and L2 regularization,
dropout, and early stopping are employed.
• 5. Hyperparameter Tuning
• Fine-tuning hyperparameters (e.g., learning rate, batch size, network
architecture) can significantly enhance model performance. Techniques like
grid search, random search, and Bayesian optimization are commonly used.
• 6. Advanced Techniques
• Batch Normalization: Normalizes the inputs of each layer, improving
convergence.
• Gradient Clipping: Prevents exploding gradients by capping them at a certain
threshold.
• 7. Distributed and Parallel Optimization
• Techniques to scale training across multiple GPUs or machines
can lead to faster training times and the ability to work with
larger datasets.
• 8. Neural Architecture Search (NAS)
• An automated process for optimizing the architecture of the
neural network itself, using techniques like reinforcement
learning or evolutionary algorithms.
challenges in neural network optimization
Some optimization techniques:
1. Vanishing and Exploding Gradients
• Vanishing Gradients: In deep networks, gradients can become very
small, making it difficult for the model to learn. This is particularly
problematic in long sequences or deep architectures.
• Exploding Gradients: Conversely, gradients can grow exponentially,
causing numerical instability and leading to divergent training.
• 2. Overfitting
• Neural networks have a high capacity to memorize training data, which
can lead to overfitting, where the model performs well on training data
but poorly on unseen data.
3. Choosing the Right Architecture
• Selecting the optimal network architecture (number of layers,
types of layers, etc.) is often trial-and-error and can significantly
impact performance.
• 4. Hyperparameter Tuning
• Finding the best hyperparameters (learning rate, batch size,
regularization strength) can be time-consuming and requires
extensive experimentation.
• 5. Local Minima and Saddle Points
• Optimization landscapes can be complex, with many local minima
and saddle points. Finding a global minimum can be challenging.
• 6. Computational Resources
• Training deep networks can require significant computational
resources and time, especially for large datasets or complex
models.
• 7. Data Quality and Quantity
• Insufficient or poor-quality data can hinder training. Imbalanced
datasets can lead to biased models.
• 8. Non-convexity
• The optimization problem in neural networks is non-convex,
making it difficult to guarantee convergence to the global
minimum.
• 9. Sensitivity to Initialization
• Poor weight initialization can lead to suboptimal training,
causing slow convergence or failure to converge.
• 10. Batch Size Effects
• The choice of batch size can influence training dynamics,
generalization, and convergence speed. Small batches can lead
to noisy gradient estimates, while large batches may require
careful tuning of the learning rate.
• 11. Generalization Across Domains
• Models trained in one domain may not generalize well to
others, raising issues with transfer learning and domain
adaptation.
• 12. Interpretability
• Understanding why a neural network makes specific
predictions is challenging, making it difficult to debug or
improve models.
Parameter Initialization
• Initializing the parameters of a deep neural network is an important step
in the training process, as it can have a significant impact on the
convergence and performance of the model. Here are some common
parameter initialization techniques used in deep learning:
Zero Initialization: Initialize all the weights and biases to zero. This is
not generally used in deep learning as it leads to symmetry in the
gradients, resulting in all the neurons learning the same feature.
Random Initialization: Initialize the weights and biases randomly from
a uniform or normal distribution. This is the most common technique
used in deep learning.
Xavier Initialization: Initialize the weights with a normal distribution
with mean 0 and variance of sqrt(1/n), where n is the number of neurons
in the previous layer. This is used for the sigmoid activation function.
• He Initialization: Initialize the weights with a normal distribution with
mean 0 and variance of sqrt(2/n), where n is the number of neurons in
the previous layer. This is used for the ReLU activation function.
Orthogonal Initialization: Initialize the weights with an orthogonal
matrix, which preserves the gradient norm during backpropagation.
Uniform Initialization: Initialize the weights with a uniform distribution.
This is less commonly used than random initialization.
Constant Initialization: Initialize the weights and biases with a constant
value. This is rarely used in deep learning.
• Adaptive Learning Rate Method
• Adaptive learning rate methods are an optimization of
gradient descent methods with the goal of minimizing the
objective function of a network by using the gradient of
the function and the parameters of the network.

•
• 1Gradient descent
• 2Adaptive Learning Rate Method
• 3Literature
• 4Weblinks
•

• Author: Johannes Kuhn

• Gradient descent
• Before the adaptive learning rate methods were introduced, the gradient descent algorithms including Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and mini-Batch Gradient Descent (mini-BGD, the mixture of
BGD ans SGD) were state-of-the-art. In essence, these methods try to update the weights θ of the network, with the help of a learning rate η, the objective function J(θ) and the gradient of it, ∇J(θ). What all gradient descent algorithms
and its improvements have in common, is the goal of minimizing J(θ) in order to find the optimal weights θ.

• The simplest of the three is the BGD.

• θ=θ−η⋅∇θJ(θ)

• It tries to reach the minimum of Jθ), by subtracting from θ the gradient of J(θ) (refere to Figure 3 for a visualization). The algorithm always computes over the whole set of data, for each update. This makes the BGD the slowest and
causes it to be unable to update online. Additionally, it performs redundant operates for big sets of data, computing similar examples at each update and it converges to the closeset minimum depending on the given data data,
resulting in potential suboptimal results.

• An often used algorithm is the SGD.

• θ=θ−η⋅∇θJ(θ;x(i);y(i))

• Figure 1:(5) Local minima may occure in J(θ) (here J(w)), which may re
• Contrary to BGD, SGD updates for each training example (x(i);y(i)), thus
updating according to a single example step. Furthermore, this fluctuation
enables the SGD to jump to minima farther away, potentially reaching a
better minimum. But thanks to this fluctuation, SGD is also able to
overshoot. This can be counteracted by slowly decreasing the learning
rate. In the exemplary code shown in Figure 2, a shuffle function is
additionally used in the SGD and mini-BGD algorithm, compared to the
BGD. This is done, as it is often preferable to avoid meaningful order of
the data and thereby avoid bias of optimization algorithm, although
sometimes better results can be achieved with data in order. In this case
the shuffle operation is to be removed.
• Lastly, there is the mini-BGD.
• θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))
• The mini-BGD updates for every mini-batch of n training examples. This leads to a more stable
convergence, by reducing the variance of the parameters. When people talk abput a SGD algorithm, they
often refer to this version.
• for i in range (nb_epoches):
• params_grad = evaluate_gradient(loss_function, data, params
• params = params - learning_rate * params_grad
• for i in range(np_epochs):
• np.random.shuffle(data)
• for example in data:
• params_grad = evaluate_gradient(loss_function, example, params
• params = params - learning_rate * params_grad
• for i in range(np_epochs):
• np.random.shuffle(data)

•
• for batch in get_batches(data, batch_size=50):
• params_grad = evaluate_gradient(loss_function, batch, params
• params = params - learning_rate * params_grad
• Figure 2: (1) Pseudo code of the three gradient descent algorithms
• sult in suboptimal solution for some gradient descent methodes.
• Adaptive Learning Rate Method
• As an improvement to traditional gradient descent algorithms, the
adaptive gradient descent optimization algorithms or adaptive
learning rate methods can be utilized. Several versions of these
algorithms are described below.
• Momentum can be seen as an evolution of the SGD.
• vt=γvt−1+η∇θJ(θ)θ=θ−vt
• While SGD has problems with data having steep curves in one direction of the
gradient, Momentum circumvents that by adding the update vector of the time
step before multiplying it with a γ, usually around 0.9 (1). As an analogy, one
can think of a ball rolling down the gradient, gathering momentum (hence the
name), while still being affected by the wind resistance (0< γ < 1).
• Nesterov accelerated gradient can be seen as a further enhancement to
momentum.
• vt=γvt−1+η∇θJ(θ−γvt−1)θ=θ−vt
• This algorithm adds a guess of the next step, in the form of the term γvt−1. A
comparison for the first two steps of Momentum and Nesterov accelerated
gradient can be found in Figure 3. The additional term results in a
consideration of the error of the previous step, accelerating the progress in
comparison to momentum.
• Contrary to the nesterov accelerated gradient, Adagrad adapts its learning
rate η during its run-time and it updates its parameters θi separately
during each time step t. It has to do that, since η adapts for every θi on its
own.
• θt+1,i=θt,i−ηGt,ii+ϵ√⋅∇θJ(θi)
• Gt is a matrix containing the squared sum of the past gradients with
regards to all θ along its diagonal.
• ϵ is correction term which is utilized to avoid dividing by 0 and is
generally insignificantly small (~10−8).
• Due to the accumulation of the squared gradients in Gt the learning rate
ηGt,ii+ϵ√ gets smaller over time, finally leading to a significantly small
rate, which causes the algorithm to obtain no new knowledge.

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (66)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Neomancer - A Cyberpunk RPG
0% (1)
Neomancer - A Cyberpunk RPG
207 pages
Soft Computing - Roy - Solutions
No ratings yet
Soft Computing - Roy - Solutions
4 pages
DL 4
No ratings yet
DL 4
15 pages
DL Class3
No ratings yet
DL Class3
28 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
GD Types
No ratings yet
GD Types
98 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Unit 3
No ratings yet
Unit 3
110 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Unit-3
No ratings yet
Unit-3
47 pages
MODULE 2 DL
No ratings yet
MODULE 2 DL
9 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
ML System Optimization - Lecture 10 - Model Optimization Techniques
No ratings yet
ML System Optimization - Lecture 10 - Model Optimization Techniques
33 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
lec7-8+CNN-2
No ratings yet
lec7-8+CNN-2
69 pages
Lecture 9 Loss, Optimizers, Batch Processing, Accuracy
No ratings yet
Lecture 9 Loss, Optimizers, Batch Processing, Accuracy
12 pages
Hyper Parameters
No ratings yet
Hyper Parameters
24 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Advantages Bpa
No ratings yet
Advantages Bpa
38 pages
6 CNN
No ratings yet
6 CNN
50 pages
Chapter 11 Neural Nets
No ratings yet
Chapter 11 Neural Nets
39 pages
Machine Learning by Tom Mitchell - Definitions
No ratings yet
Machine Learning by Tom Mitchell - Definitions
12 pages
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
No ratings yet
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
65 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
DL_Unit2
No ratings yet
DL_Unit2
113 pages
6 Batchnorm
No ratings yet
6 Batchnorm
30 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
NN 08
No ratings yet
NN 08
36 pages
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
No ratings yet
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
4 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
No ratings yet
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
9 pages
Module 2
No ratings yet
Module 2
67 pages
Unit 2
No ratings yet
Unit 2
37 pages
MECH4403 NN Week05
No ratings yet
MECH4403 NN Week05
22 pages
CNN Regularization
No ratings yet
CNN Regularization
9 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
30 pages
13031122003_SAINI_GUHA_ROY_CA2
No ratings yet
13031122003_SAINI_GUHA_ROY_CA2
8 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Module 5
No ratings yet
Module 5
72 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
Unit - 4 ANN
No ratings yet
Unit - 4 ANN
17 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
Experiment 1
No ratings yet
Experiment 1
15 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Unit – IV
No ratings yet
Unit – IV
24 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
DL Practicals
No ratings yet
DL Practicals
10 pages
MLP, Gradient Descent, Activation Functions
No ratings yet
MLP, Gradient Descent, Activation Functions
5 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
3D Interaction Techniques
No ratings yet
3D Interaction Techniques
4 pages
The Ethos of Digital Environments Technology Literary Theory and Philosophy First Edition Susanna Lindberg & Hanna-Riikka Roine - The full ebook version is just one click away
No ratings yet
The Ethos of Digital Environments Technology Literary Theory and Philosophy First Edition Susanna Lindberg & Hanna-Riikka Roine - The full ebook version is just one click away
51 pages
Similarity Is Not All You Need: Endowing Retrieval-Augmented Generation With Multi-Layered Thoughts
No ratings yet
Similarity Is Not All You Need: Endowing Retrieval-Augmented Generation With Multi-Layered Thoughts
12 pages
Knowledge Management: Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition
No ratings yet
Knowledge Management: Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition
23 pages
Lyrebird_Starter Kit_v20240919
No ratings yet
Lyrebird_Starter Kit_v20240919
18 pages
AI Guide
100% (1)
AI Guide
14 pages
AF2110 SubjectOutline 20232024sem2 (1)
No ratings yet
AF2110 SubjectOutline 20232024sem2 (1)
3 pages
CIS Academic Integrity Policy 2023-2024
No ratings yet
CIS Academic Integrity Policy 2023-2024
7 pages
Peerj Cs 2546
No ratings yet
Peerj Cs 2546
44 pages
CreatingLargeLanguageModelApplicationsUtilizingLangChain-APrimeronDevelopingLLMAppsFast
No ratings yet
CreatingLargeLanguageModelApplicationsUtilizingLangChain-APrimeronDevelopingLLMAppsFast
8 pages
The Digitalisation of Supply Chain A Review
No ratings yet
The Digitalisation of Supply Chain A Review
11 pages
Cs8691 Ai Unit 1
No ratings yet
Cs8691 Ai Unit 1
52 pages
Data Science Notes - TutorialsDuniya
No ratings yet
Data Science Notes - TutorialsDuniya
59 pages
Course Content Session 2023-25 6th Semester
No ratings yet
Course Content Session 2023-25 6th Semester
148 pages
VideoGigaGAN: Adobe's Leap in Video Super-Resolution Technology
No ratings yet
VideoGigaGAN: Adobe's Leap in Video Super-Resolution Technology
8 pages
ALVINN
No ratings yet
ALVINN
45 pages
Pattern Recognition - Lec02
No ratings yet
Pattern Recognition - Lec02
44 pages
NCA-GENM Dumps
No ratings yet
NCA-GENM Dumps
15 pages
13 Cross - Validation
No ratings yet
13 Cross - Validation
4 pages
Efficient-Email-Phishing-Detection-using-Machine-Learning (1)
No ratings yet
Efficient-Email-Phishing-Detection-using-Machine-Learning (1)
10 pages
2024 FAS1 Sample Revisions
No ratings yet
2024 FAS1 Sample Revisions
15 pages
Cognitive 4th Chapter
No ratings yet
Cognitive 4th Chapter
33 pages
Object Detection Classification and Tracking For Autonomous Veh
No ratings yet
Object Detection Classification and Tracking For Autonomous Veh
46 pages
TOK Essay
0% (1)
TOK Essay
6 pages
ORALDEFENSE.PPT
No ratings yet
ORALDEFENSE.PPT
30 pages
Robotic Arc Welding Sensors and Programming
No ratings yet
Robotic Arc Welding Sensors and Programming
17 pages
R20 ML - Unit-1
No ratings yet
R20 ML - Unit-1
23 pages