Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

Deep Learning Modeule V01

Uploaded by

assiadakiche74
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Deep Learning Modeule V01

Uploaded by

assiadakiche74
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

DEEP LEARNING

CHAPTER ONE
2

AGENDA
Introduction
Artificial Neural networks
Applications
INTRODUCTION
4

DEEPLEARNING
• Deep learning is a subset of machine learning that utilizes neural networks
with multiple layers to extract high-level features from data. It mimics the way
the human brain works by processing data through layers of interconnected
nodes (neurons).
• In a neural network, each neuron receives input signals, processes them, and
produces an output signal. The connections between neurons have associated
weights that determine the strength of influence one neuron has on another.
• Artificial neural network???
ANN
Artificial Neural networks
6

ARTIFICIAL NEURAL
NETWORKS
Computational models inspired by the human brain:
• Algorithms that try to mimic the brain.

• Massively parallel, distributed system, made up of simple processing units


(neurons)

• Synaptic connection strengths among neurons are used to store the acquired
knowledge.

• Knowledge is acquired by the network from its environment through a


learning process
HISTORY
7
late-1800's - Neural Networks appear as an analogy to biological systems
1960's and 70's – Simple neural networks appear
• Fall out of favor because the perceptron is not effective by itself, and
there were no good algorithms for multilayer nets
1986 – Backpropagation algorithm appears
• Neural Networks have a resurgence in popularity
• More computationally expensive
late-1800's - Neural Networks appear as an analogy to biological systems
1960's and 70's – Simple neural networks appear
• Fall out of favor because the perceptron is not effective by itself, and
there were no good algorithms for multilayer nets
1986 – Backpropagation algorithm appears
• Neural Networks have a resurgence in popularity
• More computationally expensive
EXAMPLE OF ANN APPLICATION
8
Churn prediction: predict if the client will continue using the service of the
company or not.
Input: Customer data.
Output: The client can be churned or not.

Balck
Customer data Box Churn?
EXAMPLE OF ANN APPLICATION
9

Customer data Balck Churn?


Box

Fraud detection Balck Fraud


Box
EXAMPLE OF ANN APPLICATION
Machine Learning (ML) VS ANN: 10

1. Model Complexity: Often simpler models such as logistic regression, decision trees,
or random forests are used.
2. Feature Engineering : Requires manual feature selection and engineering based on
domain knowledge and data analysis.
3. Model Complexity: Artificial Neural Networks (ANN) :Can handle complex
relationships in data due to the multiple layers and non-linear activation functions.
4. Feature Engineering: ANN can automatically learn relevant features from raw data,
potentially reducing the need for extensive feature engineering.
WHEN TO CONSIDER NEURAL 11

NETWORKS

1. Input is high-dimensional discrete or raw-valued


2. Output is discrete or real-valued
3. Output is a vector of values
4. Possibly noisy data
5. Form of target function is unknown
6. Human readability of the result is not important
A NEURON
BIOLOGICAL NETWORKS
13
1. The majority of neurons encode their outputs
or activations as a series of brief electrical
pulses (i.e. spikes or action potentials).
2. Dendrites are the receptive zones that receive
activation from other neurons.
3. The cell body (soma) of the neuron’s
processes the incoming activations and converts
them into output activations.
4. Axons are transmission lines that send
activation to other neurons.
5. Synapses allow weighted transmission of
signals (using neurotransmitters) between
axons and dendrites to build up large neural
networks.
Networks of McCulloch-Pitts Neurons
14

Artificial neurons have the same basic components as biological neurons.


The simplest ANNs consist of a set of McCulloch-Pitts neurons labelled
by indices k, i, j and activation flows between them via synapses with
strengths wki, wij:
A NEURON (= A PERCEPTRON) 15

- t
x0 w0
x1 w1
å f
output y
xn wn

Input weight weighted


vector x vector w sum

The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
A NEURON (= A PERCEPTRON) 16

Xi

10

Sum/ 10
30 3

20

Digramm present the arithmetic average


A NEURON (= A PERCEPTRON) 17

Xi Wi

10 1/3

Sum Sng(10*1/3+30*1/3+20*1/3) 20
30 1/3

20 1/3

Digramm present the weighted average


A NEURON (= A PERCEPTRON) 18

Xi Wi

10 0.5

Sum X=[10 W=[0.5 Y=


0.2 30 0.2 XT*W
30
20] 0.1]

20 0.1

Digramm present the liner relationship


ANN FLOW 19

1. Input: Very important each data


has label (Explain with cat example)
• Tabular data
• TXT data
• Images data
• Audio data

2. Hiden Layer:
The input and the output pairs can have
complex relationships, and to decode
theses, hidden layers exist between the
input and the output layers.

3. Output:
The Output layer of neural network
presents the final predictions generated
by the network.
Regression : one output/ classification
depend to class number
DIFFERENT TYPES OF CONNECTIONS BETWEEN NEURONS 20

1. Feedforward Connections: These connections propagate


information in one direction, from input neurons through hidden
layers to output neurons, without any feedback loops. Feedforward
connections are characteristic of standard multilayer perceptron
(MLP) architectures.

2. Recurrent Connections: Unlike feedforward connections,


recurrent connections allow information to loop back to previous
layers or even the same layer. This type of connection enables
memory and temporal dynamics in the network, making it suitable
for tasks such as sequential data processing, time series analysis, and
natural language processing.

Questions: still two other connections???


ANN MODELS
BASICS:
21

W : represents the parameters or weights that the model learns during


training. These weights are adjusted iteratively during the training
process to minimize the error between the predicted output and the actual
output.
it's not feasible to try an infinite number of values for W to find the
optimal solution and Minimum loss function.

Instead, gradient descent or other optimization algorithms iteratively


adjust the weights based on the gradients of the loss function with respect
to the weights, moving them towards the direction of lower loss.

Training data Y= W*X+b Y= 0.5*X+3


Blue : Active client Test : We have data of new client
Orange: Churned Client didn’t exist in the training data (Test)
After training the data we and we will predict new results
get our model the line in the
graphic : and we are able the X=5
make difference between
active client and churned Y= 0.5*(6)+3
client Y=6
PERCEPTRON
22

Input weight weighted Activation output y


vector x vector w sum function
1. INPUT LAYER Digram present the arithmetic average 23

Xi

X1 = 10

10 Sum/ 10
Xi 30 3
= X2 = 30
20

X3 = 20
EXAMPLE:
1. We want to predict whether the patient is diabetic or not. 24
2. We have the following variables: diagnoses, procedures, 1. Input
medications from that patient’s medical record (input “X”) vector x
3. want to use “X” to predict “y”, i.e. use their medical record 10 0
to predict whether they diabetic or not . You also want many Xi 30 wi = 0
2. Weight =
training examples 20 0
vector w

X patient : diagnoses, procedures, medications 3.weighted Y= XT*W …………(1)


Not diabetic 10 30 20 Sum
Xi input layer Hidden layer output layer Y= (x1*w1)+ (x2*w2)+ (x3*w3)

Xi Y= (10*0)+ (30*0)+ (20*0)= 0


b
Y= XT*W + b …………(2) with bais b=5
Wi Y= 0+5=5
10
4.Activation 1 if XT*W+b >0
Hidden 0 function 0 otherwise
30

5.Output Y= 1
Error of Prediction
20
X patient : diagnoses, procedures, medications 25
Not diabetic 10 30 20 How we can estimate Wi and
minimize the lost function
Xi input layer Hidden layer output layer
Loss function : Y real – Y predicted = 0-1

Xi error Loss function : Y real – Y predicted = 0-


Wi 1=100
diagnoses
10

Activation
A function 1 or 0
procedures 30 Backpropagation: is an iterative optimization
algorithm that trains a neural network by
adjusting the weights of the connections in the
network in order to minimize the difference
medications 20 b between the predicted output and the actual
output for a given set of training examples.
ACTIVATION FUNCTION
26
Definition: An activation function is a mathematical function applied to the weighted
sum of the inputs of a neuron, optionally with a bias added, to determine the neuron's
output.

Objectives of Activation Functions:

Introduce Non-Linearity: One of the primary objectives of activation functions is to


introduce non-linearity to the neural network. Without non-linear activation functions,
no matter how many layers are added, the entire network would behave like a single-
layer perceptron, unable to learn complex patterns in the data.

The activation functions have different properties and are


suited for different types of neural network architectures and
tasks. The choice of an activation function depends on factors
such as the specific problem you are trying to solve, the nature
of the data, and the architecture of your neural network.
ACTIVATION FUNCTION
27
Definition: An activation function is a mathematical function applied to the weighted
sum of the inputs of a neuron, optionally with a bias added, to determine the neuron's
output.

Objectives of Activation Functions:

Introduce Non-Linearity: One of the primary objectives of activation functions is to


introduce non-linearity to the neural network. Without non-linear activation functions,
no matter how many layers are added, the entire network would behave like a single-
layer perceptron, unable to learn complex patterns in the data.

The activation functions have different properties and are


suited for different types of neural network architectures and
tasks. The choice of an activation function depends on factors
such as the specific problem you are trying to solve, the nature
of the data, and the architecture of your neural network.
ACTIVATION FUNCTION
28
BIAS
29

What is bias in a neural network?


In simple words, neural network bias can be defined as the
constant which is added to the product of features and weights.
It is used to offset the result. It helps the models to shift the
activation function towards the positive or negative side.
LOSS FUNCTION
A loss function, also known as a cost function or objective function, measures how well a machine learning model performs on a dataset by 30
comparing its predictions to the actual target values. The goal during training is to minimize this loss function.

In the context of neural networks, the choice of a loss function depends on the type of problem being solved (e.g., classification, regression)
and the desired behavior of the model. Here are some commonly used loss functions:
BACK PROBAGATION
Definition: Backpropagation is an iterative optimization algorithm used to train neural networks by 31
computing the gradients of the loss function with respect to the weights of the network, and then using
these gradients to update the weights in a direction that reduces the loss.
Explanation:
Forward Pass: During the forward pass, input data is fed into the neural network, and the network computes
its output by propagating the input forward through the network layers using the current weights. The
predicted output is compared to the actual output, and the error (the difference between predicted and actual
output) is computed.

Backward Pass: During the backward pass, the error is propagated backward through the network to calculate
the gradients of the loss function with respect to each weight in the network. This is done using the chain rule
of calculus, which allows the error at the output layer to be decomposed into contributions from each weight in
the network.
Weight Update: Once the gradients of the loss function with respect to each weight are computed, the
weights are updated in the direction that reduces the loss. This is typically done using an optimization
algorithm such as gradient descent, which updates the weights by subtracting a fraction of the gradient
from each weight.
EXAMPLE
1. ANN structure 32
Input layer: One neuron
Hidden layer: Two neurons
Output layer: One neuron
X= 0.6
2. Weights connecting input to hidden layer
W 1,1 = 0.5
W1,2 = 0.3
3. Biases for hidden layer
b1 = 0.1

4. Weights connecting hidden to output layer

W3,1 = 0.1
W3,2 = -0.2
5. Biases for output layer
b2 = 0.4 Loss function =−[y target * log(a3) + (1- y target ) * log(1-a3)]

6. Activation function ReLU

7. The binary cross-entropy loss function


EXAMPLE
33
Input layer: One neuron
Hidden layer: Two neurons
Output layer: One neuron

input layer Hidden layer

output layer
W1,1 = 0.5, b1=0.1 a1 = 0.4, w3,1=0.1 , b2=-+0.4
Z1
Output
X=0.6 X1

W1,2 =0.3
Z2
a2 = 0.28, w3,2=0.2

Forward Pass
weighted Z3 = (0.4×0.1) +(0.28×0.2) +0.4 = 0.496/ 0384
weighted Z1 = (0.6*0.5)+0.1 = 0.4 Sum
Sum
Z2 = (0.6* 0.3)+0.1 = 0.28 Activation a3 = Max(0,* 0.496) = 0.496 /n= 0.62/0.59
function
Activation a1 = Max(0,*0.4) = 0.4
function Loss function =−[y target * log(a3) + (1- y target ) * log(1-a3)]
a2 = Max(0,* 0.28) = 0.28 ​
Loss function = loss = −[1x log(0.496) + (1-1 ) x log(1-0.496)] = 0.30
34

Click to add picture

1. 0.62/0.59

Loss function =−[y target * log(a3) + (1- y target ) * log(1-a3)]



Loss function = −[1* log(0.62)]= 0.2
​Loss function = −[1* log(0.59)]= 0.23

Loss function = loss =1/2*(1-0.62)**2= 0.072


EXAMPLE
35
Input layer: One neuron
Hidden layer: Two neurons
Output layer: One neuron

input layer Hidden layer

output layer
W1,1 = 0.5, b1=0.1 a1 = 0.4, w3,1=0.1 , b2=-+0.4
Z1
Output
X=0.6 X1

W1,2 =0.3
Z2
a2 = 0.28, w3,2=0.2

Forward Pass
weighted Z3 = (0.4×0.1) +(0.28×0.2) +0.4 = 0.496
weighted Z1 = (0.6 x 0.5)+0.1 = 0.4 Sum
Sum
Z2 = (0.6 x 0.3)+0.1 = 0.28 Activation a3 = Max(0, 0.496) = 0.496
function
Activation a1 = Max(0,0.4) = 0.4
function Loss function = loss =1/2*(Y target- Y predict)**2
a2 = Max(0, 0.28) = 0.28
Loss function = loss =1/2x(1- 0.496)**2 = 0.127
EXAMPLE
Business context/ Problematic 1. ANN structure 36
Input layer: 04
1. We want to predict whether the patient is diabetic Hidden layer: Three neurons
or not. Output layer: One neuron
2. We have the following variables: diagnoses,
procedures, medications, lab from that patient’s 2. Weights connecting input to hidden layer
medical record (input “X”) W 1,1 W 1,2 W 1,3 = 3/-1.5/2
3. Want to use “X” to predict “y”, i.e. use their W 2,1 W 2,2 W 2,3 = 1/-3/1
medical record to predict whether they diabetic or W 3,1 W 3,2 W 3,3 = 5/-7.1/-6
not . You also want many training examples W 4,1 W 4,2 W 4,3 = -4/5.2/2.9
3. Biases for hidden layer
Data Input
b1 = -2
diabetic : Yes (1) 4. Weights connecting hidden to output layer
W5,1 W5,2 W5,3 = -2/5/1
Diagnoses : 0.5
Procedures :2.8
Medications : 0 Loss function =−[y target * log(a3) + (1- y
target ) * log(1-a3)] 5. Biases for output layer
lab :- 0.1
​ b2 = 0.6

6. Activation function Relu/Sigmoid

7. The binary cross-entropy loss function


EXAMPLE :
37
1. ANN structure

Number of Wi
Total 1 = nb Input layer * Nb Hidden Layer = 4*3 =12
Total 2= nb Hidden layer * Nb Output Layer = 3*1 =3
Total = 12+3 =15
EXAMPLE :
38
2. FIRST LAYER
weighted
Input first node E (WE ): Sum

WE= (0.5x3)+(2.8x1)+(0x5)+(-0.1 x -4)+2 = 2.7


Input first node F (WF ):
FE= (0.5x-1.5)+(2.8x-3)+(0x7.1)+(-0.1 x 5.2)+2 =-11.67
Input first node G (GE ):
GE= (0.5x2)+(2.8x1)+(0x-6)+(-0.1 x 2.9)+2 =1.51
EXAMPLE :
39
2. FIRST LAYER
weighted
Sum

Input first node F (WF ):


FE= (0.5x-1.5)+(2.8x-3)+(0x7.1)+(-0.1 x 5.2)+2 =-11.67
EXAMPLE :
40
2. FIRST LAYER
weighted
Sum

Input first node G (GE ):


GE= (0.5x2)+(2.8x1)+(0x-6)+(-0.1 x 2.9)+2 =1.51
EXAMPLE :
41
2. FIRST LAYER
weighted
Sum
EXAMPLE :
42
Activation Function Activation
function
EXAMPLE : HIDDEN LAYER
43
44

Output Node Segmoin= (0.937x-2)+(0.000009x5)+(0.819 x


1)+0.6 = -0.455

Output Node= 0.036

Question can we use another


activation function??
45

7. The binary cross-entropy loss function 0.388/ 0.036

Loss function =−[y target * log(a3) + (1- y target ) * log(1-a3)]


Y real=1

Loss function = 0.41

Loss function 2= 1.44



Y real=0

Loss function = 0.21


Loss function 2= 0.016



EXERCICE
Data Input
1. ANN structure 46
diabetic : Yes (1) 4. Weights connecting hidden to output layer
Input layer: 04
Hidden layer: First Three neurons W8,1 W8,2 W8,3 = 3.5/-4.5/1
Diagnoses : 0.5
Hidden layer: Second Three neurons Procedures :2.8 5. Biases for output layer
Medications : 1
Output layer: One neuron lab :- 0.1 b2 = 0.6
2. Weights connecting input to to First hidden layer
6. Activation function First method
W 1,1 W 1,2 W 1,3 = 3/-1.5/2
W 2,1 W 2,2 W 2,3 = 1/-3/1 6. 1. Relu /Relu/ Sigmoid
W 3,1 W 3,2 W 3,3 = 5/-7.1/-6
6. 2. Relu / Sigmoid / Sigmoid
W 4,1 W 4,2 W 4,3 = -4/5.2/2.9

3. Biases for hidden layer 6. 3. Sigmoid / Sigmoid / Sigmoid


b1 = -2

4. Weights connecting First hidden layer to Second hidden layer 7. The binary cross-entropy loss function

W 5,1 W 5,2 W 5,3 = 3.5/.5/4 Calculate the 03 loss function


W 6,1 W 6,2 W 6,3 = 1/-3.5/2 acocrding to each Activation
W 7,1 W 7,2 W 7,3 = 3/-2.1/-1 function method
Solution: 47
Solution: 48

To calculate the net input for each neuron in the first hidden
layer, we use the formula:

Let's calculate the net inputs for the first hidden layer:

Wheigt
z 1=(3×0.5)+(1×2.8)+(5×1)+(−4×−0.1)−2=1.5+2.8+5+0.4−2=7.7
z2=(−1.5×0.5)+(−3×2.8)+(−7.1×1)+(5.2×−0.1)−2=−0.75−8.4−7.1−0.52−2=−18.77
z3=(2×0.5)+(1×2.8)+(−6×1)+(2.9×−0.1)−2=1+2.8−6−0.29−2=−4.49
Activation function

a1=Max(0, 7.7) = 7.7


a2= Max(0, −18.77) =0
A3=ax(0, −4.49) =0
Solution: 49

Step 3: First Hidden Layer to Second Hidden Layer

Wheigt
z 4=(3.5×7.7)+(1×0)+(3×0)−2=26.95−2= 24.95
z 5=(0.5×7.7)+(−3.5×0)+(−2.1×0)−2=3.85−2= 1.85
z 6=(4×7.7)+(2×0)+(−1×0)−2=30.8−2= 28.8

Activation function

a1=Max(0, 24.95) = 24.95


a2= Max(0, 1.85) = 1.85
A3=Max(0, 28.8) = 28.8
Solution: 50

Step 5: Second Hidden Layer to Output Layer

Wheigt
z 7=(3.5×24.95)+(−4.5×1.85)+(1×28.8)+0.6
Step 6: Activation - Sigmoid
Applying the sigmoid activation function:

a7= 1

Step 7: Calculate Loss


we can use the binary cross-entropy loss function:

L=−log(a 7) = 0

New case if Y = 0
Solution: 51

Step 5: Second Hidden Layer to Output Layer

Wheigt
z 7=(3.5×24.95)+(−4.5×1.85)+(1×28.8)+0.6
Step 6: Activation - Sigmoid
Applying the sigmoid activation function:

a7= 1

Step 7: Calculate Loss


we can use the binary cross-entropy loss function:

L=−log(a 7) = 0

New case if Y = 0
Solution: 52

Step 7: Calculate Loss


Given 𝑌=0 Y=0 and the output 𝑎7≈1 a7≈1, we can use the
binary cross-entropy loss function:
𝐿=−(𝑌log(𝑎7)+(1−𝑌)log⁡(1−𝑎7))
L=−(Ylog(a 7 )+(1−Y)log(1−a 7 ))

Since
𝑌=0Y=0, the loss simplifies to:𝐿=−log(1− 𝑎7)
L=−log(1−a 7)

Plugging in the value of 𝑎7 a 7


we get:
𝐿=−log(1−1)=−log(0)
L=−log(1−1)=−log(0)

This means the loss diverges to infinity, indicating a large error


in the prediction.
Solution: 53

Gradient Descent
To update the weights and biases using gradient descent, we
need to calculate the derivatives of the loss function with
respect to each weight and bias.

The general update rule for gradient descent is:


new weight=old weight−𝜂×gradient

Where:
𝜂 (eta) is the learning rate.
gradient
gradient is the gradient of the loss function with respect to the
weight.
We need to calculate the gradients for each weight and bias in
the network.
Solution: 54

Calculating Gradients
For the output layer, the gradient of the loss function with
respect to the output activation
𝑎7
is given by:

∂𝐿/∂𝑎7=1/𝑎7​
For the sigmoid activation function, the derivative is:

σ ′(z)=σ(z)×(1−σ(z))

So, the gradient of the loss function with respect to the net input
𝑧7 is:
HOW NEURAL NETWORKS LEARN
USING GRADIENT DESCENT
CONCEPT
56
DEFINITION
57

Gradient descent is a popular optimization approach


for training machine learning models and neural
networks. These models evolve over time with the use
of training data, and the cost function inside gradient
descent especially functions as a barometer, assessing its
correctness with each iteration of parameter changes.

Optimization methods: are used in many areas of study


to find solutions that maximize or minimize some study
parameters, such as minimize costs in the production of
a good or service, maximize profits, minimize raw
material in the development of a good, or maximize
production.

The purpose of gradient descent, like finding the line of best


fit in linear regression, is to minimize the cost function or the
difference between expected and real values.
DEFINITION
58
1. Gradient descent define the Wi (Weight) values
that minimizes the error.
2. For example, you can imagine that a person is
standing on the top of a mountain and he is trying to
reach the bottom or the foot of the mountain.
3. The gradient descent can help him to find the best
way .

1. We have the initial weight can be chosen randomly.


2. The step of descending will be by using the
derivative function that give us the first step.
3. New Wi values:
Wi (new)= Wi - learning_rate (eta) * derivative (loss
function/wi)
Negative help us to descent the mountain

Learning rate ???


DEFINITION
59
LEARNING RATE
60
Learning rate (also referred to as step size or the alpha) — The size of the steps taken to reach the minimum is
known as step size. It is usually a small number and is adjusted depending on the behavior of the cost function.
Having a large learning rate will cause larger steps, but there is a chance that it could go past the minimum. On the
other hand, a low learning rate will have smaller steps and be more accurate, but it will take a lot more time and
calculations to reach the minimum.
LEARNING RATE
61

How we can know what is the good learning rate for our
model.
Empirical side by testing multiple value between 0 to 1 give
you the best accuracy or RMSE
CHALLENGES WITH GRADIENT DESCENT
62
Local minima and saddle points — Gradient descent can easily find the global minimum of convex problems, but
it can struggle to do the same for nonconvex problems, where the model would achieve the best results. Once the
slope of the cost function approaches or reaches zero, the learning process stops. This can happen in multiple
scenarios, such as local minima and saddle points. The former looks like a global minimum, with the slope
increasing on either side of the current point. The latter is named after a horse’s saddle, since the negative gradient
only exists on one side of the point, with a local maximum on one side and a local minimum on the other. To escape
such locations, it can be beneficial to introduce noise to the gradient.
MATH OF GRADIENT DESCENT
63
MATH OF GRADIENT DESCENT
64
MATH OF GRADIENT DESCENT
65
MATH OF GRADIENT DESCENT
66
EPOCH, ITERATION, BATCH
67
One Epoch is when an ENTIRE dataset is passed forward
and backward through the neural network only ONCE.

Since one epoch is too big to feed to the computer


at once we divide it in several smaller batches.

So, what is the right numbers of epochs?


EPOCH, ITERATION, BATCH

Unfortunately, there is no right answer to this question. The 68


answer is different for different datasets but you can say that the
numbers of epochs is related to how diverse your data
A batch is simply a subset of the training dataset. For example,
if you have 1000 training examples and you set a batch size of
50, then each batch will contain 50 examples, and there will be
20 batches in one epoch.

Iteration: An iteration refers to the process of updating the


model's parameters once based on the data in one batch. So, in
one epoch, there are as many iterations as there are batches. For
instance, if there are 20 batches in one epoch (using the
previous example), there will be 50 iterations in that epoch.
WHY MORE THAN ONE EPOCH
69

Critical point:
we are using Gradient Descent which is an iterative
process. So, updating the weights with single pass
or one epoch is not enough.

As the number of epochs increases, more number of times the


weight are changed in the neural network and the curve goes
from underfitting to optimal to overfitting curve.
THANK
YOU
Mr Hocine Abdelouaheb

You might also like