Deep Learning Modeule V01
Deep Learning Modeule V01
CHAPTER ONE
2
AGENDA
Introduction
Artificial Neural networks
Applications
INTRODUCTION
4
DEEPLEARNING
• Deep learning is a subset of machine learning that utilizes neural networks
with multiple layers to extract high-level features from data. It mimics the way
the human brain works by processing data through layers of interconnected
nodes (neurons).
• In a neural network, each neuron receives input signals, processes them, and
produces an output signal. The connections between neurons have associated
weights that determine the strength of influence one neuron has on another.
• Artificial neural network???
ANN
Artificial Neural networks
6
ARTIFICIAL NEURAL
NETWORKS
Computational models inspired by the human brain:
• Algorithms that try to mimic the brain.
• Synaptic connection strengths among neurons are used to store the acquired
knowledge.
Balck
Customer data Box Churn?
EXAMPLE OF ANN APPLICATION
9
1. Model Complexity: Often simpler models such as logistic regression, decision trees,
or random forests are used.
2. Feature Engineering : Requires manual feature selection and engineering based on
domain knowledge and data analysis.
3. Model Complexity: Artificial Neural Networks (ANN) :Can handle complex
relationships in data due to the multiple layers and non-linear activation functions.
4. Feature Engineering: ANN can automatically learn relevant features from raw data,
potentially reducing the need for extensive feature engineering.
WHEN TO CONSIDER NEURAL 11
NETWORKS
- t
x0 w0
x1 w1
å f
output y
xn wn
The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
A NEURON (= A PERCEPTRON) 16
Xi
10
Sum/ 10
30 3
20
Xi Wi
10 1/3
Sum Sng(10*1/3+30*1/3+20*1/3) 20
30 1/3
20 1/3
Xi Wi
10 0.5
20 0.1
2. Hiden Layer:
The input and the output pairs can have
complex relationships, and to decode
theses, hidden layers exist between the
input and the output layers.
3. Output:
The Output layer of neural network
presents the final predictions generated
by the network.
Regression : one output/ classification
depend to class number
DIFFERENT TYPES OF CONNECTIONS BETWEEN NEURONS 20
Xi
X1 = 10
10 Sum/ 10
Xi 30 3
= X2 = 30
20
X3 = 20
EXAMPLE:
1. We want to predict whether the patient is diabetic or not. 24
2. We have the following variables: diagnoses, procedures, 1. Input
medications from that patient’s medical record (input “X”) vector x
3. want to use “X” to predict “y”, i.e. use their medical record 10 0
to predict whether they diabetic or not . You also want many Xi 30 wi = 0
2. Weight =
training examples 20 0
vector w
5.Output Y= 1
Error of Prediction
20
X patient : diagnoses, procedures, medications 25
Not diabetic 10 30 20 How we can estimate Wi and
minimize the lost function
Xi input layer Hidden layer output layer
Loss function : Y real – Y predicted = 0-1
Activation
A function 1 or 0
procedures 30 Backpropagation: is an iterative optimization
algorithm that trains a neural network by
adjusting the weights of the connections in the
network in order to minimize the difference
medications 20 b between the predicted output and the actual
output for a given set of training examples.
ACTIVATION FUNCTION
26
Definition: An activation function is a mathematical function applied to the weighted
sum of the inputs of a neuron, optionally with a bias added, to determine the neuron's
output.
In the context of neural networks, the choice of a loss function depends on the type of problem being solved (e.g., classification, regression)
and the desired behavior of the model. Here are some commonly used loss functions:
BACK PROBAGATION
Definition: Backpropagation is an iterative optimization algorithm used to train neural networks by 31
computing the gradients of the loss function with respect to the weights of the network, and then using
these gradients to update the weights in a direction that reduces the loss.
Explanation:
Forward Pass: During the forward pass, input data is fed into the neural network, and the network computes
its output by propagating the input forward through the network layers using the current weights. The
predicted output is compared to the actual output, and the error (the difference between predicted and actual
output) is computed.
Backward Pass: During the backward pass, the error is propagated backward through the network to calculate
the gradients of the loss function with respect to each weight in the network. This is done using the chain rule
of calculus, which allows the error at the output layer to be decomposed into contributions from each weight in
the network.
Weight Update: Once the gradients of the loss function with respect to each weight are computed, the
weights are updated in the direction that reduces the loss. This is typically done using an optimization
algorithm such as gradient descent, which updates the weights by subtracting a fraction of the gradient
from each weight.
EXAMPLE
1. ANN structure 32
Input layer: One neuron
Hidden layer: Two neurons
Output layer: One neuron
X= 0.6
2. Weights connecting input to hidden layer
W 1,1 = 0.5
W1,2 = 0.3
3. Biases for hidden layer
b1 = 0.1
W3,1 = 0.1
W3,2 = -0.2
5. Biases for output layer
b2 = 0.4 Loss function =−[y target * log(a3) + (1- y target ) * log(1-a3)]
6. Activation function ReLU
output layer
W1,1 = 0.5, b1=0.1 a1 = 0.4, w3,1=0.1 , b2=-+0.4
Z1
Output
X=0.6 X1
W1,2 =0.3
Z2
a2 = 0.28, w3,2=0.2
Forward Pass
weighted Z3 = (0.4×0.1) +(0.28×0.2) +0.4 = 0.496/ 0384
weighted Z1 = (0.6*0.5)+0.1 = 0.4 Sum
Sum
Z2 = (0.6* 0.3)+0.1 = 0.28 Activation a3 = Max(0,* 0.496) = 0.496 /n= 0.62/0.59
function
Activation a1 = Max(0,*0.4) = 0.4
function Loss function =−[y target * log(a3) + (1- y target ) * log(1-a3)]
a2 = Max(0,* 0.28) = 0.28
Loss function = loss = −[1x log(0.496) + (1-1 ) x log(1-0.496)] = 0.30
34
1. 0.62/0.59
output layer
W1,1 = 0.5, b1=0.1 a1 = 0.4, w3,1=0.1 , b2=-+0.4
Z1
Output
X=0.6 X1
W1,2 =0.3
Z2
a2 = 0.28, w3,2=0.2
Forward Pass
weighted Z3 = (0.4×0.1) +(0.28×0.2) +0.4 = 0.496
weighted Z1 = (0.6 x 0.5)+0.1 = 0.4 Sum
Sum
Z2 = (0.6 x 0.3)+0.1 = 0.28 Activation a3 = Max(0, 0.496) = 0.496
function
Activation a1 = Max(0,0.4) = 0.4
function Loss function = loss =1/2*(Y target- Y predict)**2
a2 = Max(0, 0.28) = 0.28
Loss function = loss =1/2x(1- 0.496)**2 = 0.127
EXAMPLE
Business context/ Problematic 1. ANN structure 36
Input layer: 04
1. We want to predict whether the patient is diabetic Hidden layer: Three neurons
or not. Output layer: One neuron
2. We have the following variables: diagnoses,
procedures, medications, lab from that patient’s 2. Weights connecting input to hidden layer
medical record (input “X”) W 1,1 W 1,2 W 1,3 = 3/-1.5/2
3. Want to use “X” to predict “y”, i.e. use their W 2,1 W 2,2 W 2,3 = 1/-3/1
medical record to predict whether they diabetic or W 3,1 W 3,2 W 3,3 = 5/-7.1/-6
not . You also want many training examples W 4,1 W 4,2 W 4,3 = -4/5.2/2.9
3. Biases for hidden layer
Data Input
b1 = -2
diabetic : Yes (1) 4. Weights connecting hidden to output layer
W5,1 W5,2 W5,3 = -2/5/1
Diagnoses : 0.5
Procedures :2.8
Medications : 0 Loss function =−[y target * log(a3) + (1- y
target ) * log(1-a3)] 5. Biases for output layer
lab :- 0.1
b2 = 0.6
Number of Wi
Total 1 = nb Input layer * Nb Hidden Layer = 4*3 =12
Total 2= nb Hidden layer * Nb Output Layer = 3*1 =3
Total = 12+3 =15
EXAMPLE :
38
2. FIRST LAYER
weighted
Input first node E (WE ): Sum
4. Weights connecting First hidden layer to Second hidden layer 7. The binary cross-entropy loss function
To calculate the net input for each neuron in the first hidden
layer, we use the formula:
Let's calculate the net inputs for the first hidden layer:
Wheigt
z 1=(3×0.5)+(1×2.8)+(5×1)+(−4×−0.1)−2=1.5+2.8+5+0.4−2=7.7
z2=(−1.5×0.5)+(−3×2.8)+(−7.1×1)+(5.2×−0.1)−2=−0.75−8.4−7.1−0.52−2=−18.77
z3=(2×0.5)+(1×2.8)+(−6×1)+(2.9×−0.1)−2=1+2.8−6−0.29−2=−4.49
Activation function
Wheigt
z 4=(3.5×7.7)+(1×0)+(3×0)−2=26.95−2= 24.95
z 5=(0.5×7.7)+(−3.5×0)+(−2.1×0)−2=3.85−2= 1.85
z 6=(4×7.7)+(2×0)+(−1×0)−2=30.8−2= 28.8
Activation function
Wheigt
z 7=(3.5×24.95)+(−4.5×1.85)+(1×28.8)+0.6
Step 6: Activation - Sigmoid
Applying the sigmoid activation function:
a7= 1
L=−log(a 7) = 0
New case if Y = 0
Solution: 51
Wheigt
z 7=(3.5×24.95)+(−4.5×1.85)+(1×28.8)+0.6
Step 6: Activation - Sigmoid
Applying the sigmoid activation function:
a7= 1
L=−log(a 7) = 0
New case if Y = 0
Solution: 52
Since
𝑌=0Y=0, the loss simplifies to:𝐿=−log(1− 𝑎7)
L=−log(1−a 7)
Gradient Descent
To update the weights and biases using gradient descent, we
need to calculate the derivatives of the loss function with
respect to each weight and bias.
Where:
𝜂 (eta) is the learning rate.
gradient
gradient is the gradient of the loss function with respect to the
weight.
We need to calculate the gradients for each weight and bias in
the network.
Solution: 54
Calculating Gradients
For the output layer, the gradient of the loss function with
respect to the output activation
𝑎7
is given by:
∂𝐿/∂𝑎7=1/𝑎7
For the sigmoid activation function, the derivative is:
σ ′(z)=σ(z)×(1−σ(z))
So, the gradient of the loss function with respect to the net input
𝑧7 is:
HOW NEURAL NETWORKS LEARN
USING GRADIENT DESCENT
CONCEPT
56
DEFINITION
57
How we can know what is the good learning rate for our
model.
Empirical side by testing multiple value between 0 to 1 give
you the best accuracy or RMSE
CHALLENGES WITH GRADIENT DESCENT
62
Local minima and saddle points — Gradient descent can easily find the global minimum of convex problems, but
it can struggle to do the same for nonconvex problems, where the model would achieve the best results. Once the
slope of the cost function approaches or reaches zero, the learning process stops. This can happen in multiple
scenarios, such as local minima and saddle points. The former looks like a global minimum, with the slope
increasing on either side of the current point. The latter is named after a horse’s saddle, since the negative gradient
only exists on one side of the point, with a local maximum on one side and a local minimum on the other. To escape
such locations, it can be beneficial to introduce noise to the gradient.
MATH OF GRADIENT DESCENT
63
MATH OF GRADIENT DESCENT
64
MATH OF GRADIENT DESCENT
65
MATH OF GRADIENT DESCENT
66
EPOCH, ITERATION, BATCH
67
One Epoch is when an ENTIRE dataset is passed forward
and backward through the neural network only ONCE.
Critical point:
we are using Gradient Descent which is an iterative
process. So, updating the weights with single pass
or one epoch is not enough.