lecture 9-NN- modified
lecture 9-NN- modified
Biological Inspirations
• Humans perform complex tasks like vision, motor
control, or language understanding very well.
X2 Wb S f(S) Y
X3 Wc
• Reinforcement learning
3) Activation Functions
• It is the extra force or effort applied over the input to obtain an exact output. In
ANN, we can also apply activation functions over the input to get the exact output.
Why do we need Activation Functions?
• Neural Network without activation function simply be a linear regression model
• a linear equation is polynomial of one degree.
• We want a neural network to not just learn and compute a linear function but
something more complicated than that.
• Complicated kind of data such as images, videos, audio, speech etc.
Activation Function types
ReLU Softplus
Sigmoid/logist Tanh
ic
Binary Signum
Softmax
Activation Function types
1- Binary Step
• Binary step function depends on a threshold value that
decides whether a neuron should be activated or not.
limitations
•It cannot provide multi-value outputs—for example, it cannot be used for multi-class
classification problems.
•The gradient of the step function is zero, which causes a hindrance in the backpropagation
process.
Activation Function types
2- Sigmoid / Logistic Activation Function
• This function takes any real value as input
and outputs values in the range of 0 to 1.
• it is commonly used for models where we
have to predict the probability as an
output.
• The function is differentiable and provides a
smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-
shape. It is derivable at every point.
Activation Function types
2- Sigmoid / Logistic Activation Function
limitations
•As the gradient value approaches zero, the network
stops to learn and suffers from the Vanishing gradient
problem.
•The outputs aren’t zero centred. The output of this activation function always lies
within 0 & 1 i.e. always positive. As a result, it would take a substantially longer time to
converge. Whereas zero centred function helps in fast convergence.
•It saturates and kills gradients. Refer to the figure of the derivative of the sigmoid. At
both positive and negative ends, the value of the gradient saturates at 0. That means for
those values, the gradient will be 0 or close to 0, which simply means no learning in
backpropagation.
• It is computationally expensive because of the exponential term in it.
Activation Function types
3- Tanh (Hyperbolic Tangent)
• Tanh function is very similar to the sigmoid/logistic activation
function, and even has the same S-shape with the difference in
output range of -1 to 1. it is a mathematically shifted version
of sigmoid .
• It has similar advantages as sigmoid but better than that
because it is zero centred. The output of tanh lies between
-1 and 1. Hence solving one of the issues with the sigmoid.
limitations
•It also has the problem of vanishing gradient but the
derivatives are steeper than that of the sigmoid. Hence making
the gradients stronger for tanh than sigmoid.
•As it is almost similar to sigmoid, tanh is also computationally
expensive. because of the exponential term in it.
•Similar to sigmoid, here also the gradients saturate.
Activation Function types
4- ReLU Function (Rectified Linear Unit)
•It is computationally effective as it involves simpler
mathematical operations than sigmoid and tanh.
•Although it looks like a linear function, it adds non-linearity to
the network, making it able to learn complex patterns.
•It doesn't suffer from the vanishing gradient problem.
•It is unbounded at the positive side. Hence removing the
problem of gradient saturation.
limitations
•It suffers from the dying ReLU problem. ReLU is always going to
discard the negative values i.e. the deactivations by making it 0. But
because of this, the gradient of these units will also become 0 .
• It is non-differentiable at 0.
Activation Function types
5- Softmax
• the Softmax function is described as a combination of multiple
sigmoids.
• It calculates the relative probabilities. Similar to the sigmoid/logistic activation function,
the SoftMax function returns the probability of each class.
• It is most commonly used as an activation function for the last layer of the neural
network in the case of multi-class classification.
• It is able to handle multiple classes. It normalizes the outputs for each class
between 0 and 1 and divides by their sum. Hence forming a probability
distribution. Therefore giving a clear probability of input belonging to any
particular class.
How to choose the Activation Function
• You need to match your activation function for your output layer based on the
type of prediction problem that you are solving—specifically, the type of
predicted variable.
• you can begin with using the ReLU activation function and then move over to
other activation functions if ReLU doesn’t provide optimum results.
2.Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the
model more susceptible (sensitive) to problems during training (due to vanishing gradients).
How to choose the Activation Function
a few rules for choosing the activation function for your output layer
based on the type of prediction problem that you are solving:
AND
X1 1
X1 X2 Y
Y
1 1 1
X2 1 1 0 0
0 1 0
AND Function
0 0 0
Threshold(Y) = 2
The First Neural Neural Networks
X1 2 OR
Y
X1 X2 Y
1 1 1
X2 2 1 0 1
0 1 1
OR Function
0 0 0
Threshold(Y) =2
Threshold(Y) =2
Architecture of a typical artificial neural network
XOR
W1 𝑾𝟏𝟏𝟏
W2 𝑾𝟏𝟐𝟏
W3 𝑾𝟏𝟏𝟐
W4 𝑾𝟏𝟐𝟐
W5 𝑾𝟐𝟏𝟏
W6 𝑾𝟐𝟏𝟐
W3 1/0
B W6
x2 W4
Weights =Wi
Activation function
input
Yi
W3 W6
B
x2 W4
Activation function
input
x1 W2
1/0
Yi
W3 W6
B
x2 W4
S= sop (Xi,Wi)
𝑚
s
𝑆 = 𝑥𝑖 𝑤𝑖
1
Activation function
input
x1 W2
1/0
Yi
W3 W6
B
x2 W4
s
Sop=x1w1+x2w3
Activation function
input
x1 W2
1/0
Yi
W3 W6
B
x2 s
W4
Sop=x1w2+x2w
4
Activation function
input
Input layer Hidden Output
layer layer
W1
A W5
W2
x1 1/0
Yi
W3 W6
B
x2 W4
s
Sop=s1w5+s2w
6
Activation function
output
Yi
W3 W6
B
x2 W4 s F(s)
Activation Function types
ReLU Softplus
Sigmoid/logist Tanh
ic
Binary Signum
Softmax
Activation function
output
Yi
W3 W6
B
x2 W4 s bin
Bias
hidden & output layers neurons
x1 W2
1/0
Yi
W3 W6
B S1= b1 +x1w1+x2w3
x2 W4
S2= b2 +x1w2+x2w4
b S3= b3 +s1w5+s2w6
Learning Rate 0 ≤ ≤ 1
Step n=0,1,2,……,n
x1 W2
1/0
Yi
W3 W6
B
x2 W4
b2
Desired Output dj
Yi
W3 W6
B
x2 W4
b
2
Element of Neural Network
neuron
Input Layer Layer Layer Outpu
1 2 …… L yt
𝑥1 1
…… y2
𝑥2
……
……
……
……
……
𝑥N …… yM
Input Output
Laye Hidden Layer
r Layers
Deep means many hidden
layers
Neural Network training steps
1 Weight Initialization
2 Inputs Application
4 Activation functions
5 Weights Adaptations
6 Back to step 2
Regarding 5th step: Weights Adaptation
First method:
Learning Rate 0 ≤ ≤ 0 ≤ α ≤
1 1
Regarding 5th step: Weights Adaptation
second method: Back
propagation
▪ Forward VS Backward passes
The Backpropagation algorithm is a sensible approach for dividing the contribution of each weight.
Forward Backward
Input
Prediction Prediction Prediction Prediction Input
weight SOP SOP
Output Error Error Output weights
s
Regarding 5th step: Weights Adaptation
second method: Back
propagation
▪ Backward pass
What is the change in prediction Error (E) given the change in weight (W) ?
Get partial derivative of E W.R.T
𝜕𝐸 W
𝜕𝑊
1 m
1
𝐸 = (𝑑 − 𝑦)2
𝑓(𝑠) =
1 + e−𝑠 s = xi w ji + bi w1 , w2
2 j
d (desired output) Const s (Sum Of Product SOP )
y ( predicted output)
1 1
E = (d − n
)2
2
e j x −
i wij
+b
i
Regarding 5th step: Weights Adaptation
second method: Back Chain
propagation Rule
▪ Weight derivative
1
1
𝐸 = (𝑑 − 𝑦)2
𝑦 = 𝑓(𝑠) =
1 + e−𝑠 s= x1 w1 + x 2 w2 + b w1 , w2
2
𝜕𝐸 𝜕𝐸 𝜕𝑦 s s
,
𝜕𝑊 𝜕𝑦 𝜕𝑠 w1 w2
E E y s E E y s
= x x = x x
w1 y s w1 w2 y s w2
Regarding 5th step: Weights Adaptation
second method: Back
propagation
▪ Weight derivative
𝜕𝐸 𝜕 1
= (𝑑 − 𝑦)2 = 𝑦 − 𝑑
𝜕𝑦 𝜕𝑦 2
𝜕𝑦 𝜕 1 1 1
= −𝑠
= −𝑠
(1 − −𝑠
)
𝜕𝑠 𝜕𝑠 1 + e 1+e 1+e
s s
= x1 w1 + x 2 w2 + b = x1 = x1 w1 + x 2 w2 + b = x2
w1 w1 w2 w2
E 1 1
= (y − d) −s
(1 − )
− s xi
wi 1+ e 1+ e
Regarding 5th step: Weights Adaptation
second method: Back
propagation
▪ interpreting derivatives∇𝑊
E f ( s)
= (y − d) xi
wi s
Derivatives
Derivatives sign Magnitude
Positive
directly
proportional
Negative
opposite
Regarding 5th step: Weights Adaptation
second method: Back
propagation
▪ Update the Weights
In order to update the weights , use the Gradient Descent
f(w) f(w)
- slop
+ slop
w w
Wnew= Wold - (-ve) Wnew= Wold - (+ve)
Simple Neural Network Training Example
(Backpropagation)
Neural Networks training example
Neural Networks training : Steps
Forward
1- Activation function input (SOP)
𝜕𝐸 𝜕𝐸 𝜕𝑦 s s
,
𝜕𝑊 𝜕𝑦 𝜕𝑠 w1 w2
1- partial derivative of error w.r.t. predicted output
𝜕𝐸 𝜕 1
= (𝑑 − 𝑦)2 = 𝑦 − 𝑑
𝜕𝑦 𝜕𝑦 2
Neural Networks training : Steps
Backward
𝜕𝐸 𝜕𝐸 𝜕𝑦 s s
,
𝜕𝑊 𝜕𝑦 𝜕𝑠 w1 w2
2- partial derivative of predicted output w.r.t. SOP
𝜕𝑦 𝜕 1 1 1
= −𝑠
= −𝑠
(1 − −𝑠
)
𝜕𝑠 𝜕𝑠 1 + e 1+e 1+e
Neural Networks training : Steps
Backward
𝜕𝐸 𝜕𝐸 𝜕𝑦 s
𝜕𝑠 s
,
𝜕𝑊 𝜕𝑦 𝜕𝑠 w1 w2
𝜕𝑤 1
3- partial derivative of SOP w.r.t. W1
s s
,
w1 w2
Neural Networks training : Steps
Backward
𝜕𝐸 𝜕𝐸 𝜕𝑦 s s
𝜕𝑠
,
𝜕𝑊 𝜕𝑦 𝜕𝑠 w1 𝜕𝑤w22
3- partial derivative of SOP w.r.t. W2
s
= x1 w1 + x 2 w2 + b = x 2
w2 w2
Neural Network Training Example
(Backpropagation)
Example
𝟎. 𝟏
𝟎. 𝟓 𝟎. 𝟏 𝟎. 𝟒 𝟎. 𝟒𝟖
𝒊𝒏𝒉𝟏 = 𝟎. 𝟑 =
𝟎. 𝟔𝟐 𝟎. 𝟐 −𝟎. 𝟏 𝟎. 𝟎𝟐𝟐
𝟏
𝟏
𝝈 𝒊𝒏𝒉𝟏 = 𝒐𝒖𝒕𝒉𝟏= 𝟏+𝒆−𝟎.𝟒𝟖 𝟎. 𝟔𝟏𝟖
𝟏 =
𝟎. 𝟓𝟎𝟔
𝟏_+𝒆−𝟎.𝟎𝟐𝟐
𝟎. 𝟔𝟏𝟖
𝒊𝒏𝒐𝒖𝒕 = −𝟎. 𝟐 𝟎. 𝟑 𝟏. 𝟖𝟑 𝟎. 𝟓𝟎𝟔 = 𝟏. 𝟖𝟓𝟖
𝟏
𝝈 𝒊𝒏𝒐𝒖𝒕 = 𝒐𝒖𝒕𝒐𝒖𝒕= 𝟏
= 𝟎. 𝟖𝟔𝟓
𝟏+𝒆−𝟏.𝟖𝟓𝟖
0.641
𝟎. 𝟖𝟔𝟓
0.641
𝟎.0.641
𝟖𝟔𝟓
0.64
𝟎. 𝟖𝟔𝟓
1
0.63
1
0.63
1
0.631
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1
0.63
1