Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Lecture_2 (1)

Lecture 2 covers Multilayer Perceptrons (MLPs) in deep learning, explaining the structure and functioning of neural networks inspired by biological systems. It details the perceptron model, learning algorithms such as backpropagation, and design considerations for MLPs, including activation functions and issues like overfitting and the vanishing gradient problem. The lecture emphasizes the importance of network architecture and training processes in achieving effective machine learning outcomes.

Uploaded by

Abdelrhman Adel
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture_2 (1)

Lecture 2 covers Multilayer Perceptrons (MLPs) in deep learning, explaining the structure and functioning of neural networks inspired by biological systems. It details the perceptron model, learning algorithms such as backpropagation, and design considerations for MLPs, including activation functions and issues like overfitting and the vanishing gradient problem. The lecture emphasizes the importance of network architecture and training processes in achieving effective machine learning outcomes.

Uploaded by

Abdelrhman Adel
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Lecture 2: Multilayer Perceptrons

CS460: Deep Learning


What is a Neural network ?

 Neural network is a machine that is


designed to model the way in which
the brain performs a particular task
or function of interest.
 To achieve good performance, neural
networks employ a massive
interconnection of simple computing
cells referred to as "neurons" or
"processing units."
Inspired by humans

 Thebrain is a highly complex,


nonlinear; and parallel computer. It
has the capability to organize its
neurons, so as to perform certain
computations (e.g., pattern
recognition, perception, and motor
control) many times faster than the
fastest digital computer in existence
today.
Biological Neural
Networks
Dendrites
Synapse
Synapse

Axon

Axon

Dendrites Soma
Soma
Modeling the single
neuron
Learning in simple
neurons
 If we have two groups of objects, one
group of several written A's, and the
other of B's, we may want our
neuron to tell the A's from the B's, as
in figure.
 We want it to output a 1 when an A
is presented and a 0 when it sees a
B.
Biology analogy
Biological Artificial
Soma Node/neuron
Dendrites Input
Axon Output
Synapse Weight
The perceptron
The simplest kind of neural network is a single-layer
perceptron network, which consists of a single layer
of output nodes; the inputs are fed directly to the
outputs via a series of weights. The sum of the
products of the weights and the inputs is calculated in
each node, and if the value is above some threshold
the neuron fires and takes the activated value;
otherwise it takes the deactivated value.
 Neurons with this kind of activation function are also
called artificial neurons or linear threshold units.
 In the literature the term perceptron often refers to
networks consisting of just one of these units.
The perceptron (cont’d)
 theperceptron is an algorithm for learning
a binary classifier called a
threshold function: a function that maps its
input x(a real-valued vector) to an output
value f(X) (a single binary value):

 wherew is a vector of real-valued


weights, w . x is the dot product ,
where m is the number of inputs to the
perceptron, and b is the bias.
The perceptron (cont’d)
 Perceptrons can be trained by a simple
learning algorithm that is usually called
the delta rule. It calculates the errors
between calculated output and sample
output data, and uses this to create an
adjustment to the weights, thus
implementing a form of gradient descent.
 Single-layer perceptrons are only capable
of learning linearly separable patterns
Linearly Separable
XOR Function

 Itis impossible for a single-layer


perceptron network to learn an
XOR function
Non-linear
transformations
A single-layer neural network can
compute a continuous output instead
of a step function. A common choice
is the so-called logistic function:
Non-linear
transformations
 The logistic function is one of the
family of functions called
sigmoid functions. It has a
continuous derivative, which allows
it to be used in backpropagation.
This function is also preferred
because its derivative is easily
calculated (differentiable) :
Sigmoid function
Multi Layer Perceptron
(MLP)
 MLP is a class of a feedforward (Acyclic) artificial neural
network (ANN).
 Each neuron in one layer has directed connections to
the neurons of the subsequent layer. In many
applications the units of these networks apply a
sigmoid function as an activation function.
 MLPs models are the most basic deep neural network,
which is composed of a series of fully connected layers.
 Each new layer is a set of nonlinear functions of a
weighted sum of all outputs (fully connected) from the
prior one.
 Multilayer feed-forward networks, given enough hidden
units and enough training samples, can closely
approximate any function.
The Architecture

 MLP with one hiddenx layer


1 (PE)

x2 Weighted Transfer
(PE) Sum Function
Y1
x3 (S) (f)

(PE)

(PE) (PE)

Output
(PE)
Layer

Hidden
(PE)
Layer

Input
Layer
MLP processing
(a) Single neuron (b) Multiple neurons

x1 x1 w11 (PE) Y1
w1
w21
(PE) Y

w1 w12
x2 Y  X 1W1  X 2W2
x2 w22 (PE) Y2
PE: Processing Element (or neuron)

Y1 X1W11  X 2W21
w23
Y2 X1W12  X2W22
Y3  X 2W 23 (PE) Y3
MLP processing (cont’d)

Summation function: Y = 3(0.2) + 1(0.4) + 2(0.1) = 1.2


X1 = 3 Transfer function: YT = 1/(1 + e-1.2) = 0.77

W2 = 0.4 Processing Y = 1.2


X2 = 1 YT = 0.77
element (PE)

X3 = 2
Designing the MLP
 Before training can begin, the user must decide on
the network topology by specifying:
 the number of units in the input layer,
 the number of hidden layers (if more than one), the
number of units in each hidden layer, and
 the number of units in the output layer.
 Normalizing the input values (between 0.0 and 1.0)
for each attribute measured in the training tuples
will help speed up the learning phase and prevent
the exploding gradient problem.
 Discrete-valued attributes may be encoded such
that there is one input unit per domain value.
 Choice of the transfer function
Transformation (Transfer)
Function
 Linear function
 Sigmoid (logical activation) function [0
1]
 Tangent Hyperbolic function [-1 1]
MLP: Design issues

 Neural networks can be used for both


classification (to predict the class label
of a given tuple) and numeric prediction
(to predict a continuous-valued output).
 For classification, one output unit may
be used to represent two classes (where
the value 1 represents one class, and
the value 0 represents the other).
 If there are more than two classes, then
one output unit per class is used.
MLP: Design issues
 There are no clear rules as to the “best”
number of hidden layer units.
 Network design is a trial-and-error process and
may affect the accuracy of the resulting trained
network.
 The initial values of the weights may also affect
the resulting accuracy.
 Once a network has been trained and its
accuracy is not considered acceptable, it is
common to repeat the training process with
 a different network topology or
 a different set of initial weights.
The XOR function -
revisted
MLP Box Office prediction
example
The Learning algorithm

 Itadjusts the weights of the


machine, in order to minimize the
average squared error.
Learning in MLP
 The learning algorithm procedure
 Initialize weights with random values and set
other network parameters
 Read in the inputs and the desired outputs
 Compute the actual output (by working
forward through the layers)
 Compute the error (difference between the
actual and desired output)
 Change the weights by working backward
through the hidden layers
 Repeat steps 2-5 until weights stabilize
Learning in MLP (cont’d)
 Backpropagation learns by iteratively
processing a data set of training tuples,
comparing the network’s prediction for each
tuple with the actual known target value.
 The target value may be the known class label
of the training tuple (for classification
problems) or a continuous value (for numeric
prediction).
 For each training tuple, the weights are
modified so as to minimize the mean-squared
error between the network’s prediction and the
actual target value.
Learning in MLP (cont’d)

 These modifications are made in the


“backwards” direction (i.e., from the
output layer) through each hidden
layer down to the first hidden layer
(hence the name backpropagation).
 Although it is not guaranteed, in
general the weights will eventually
converge, and the learning process
stops.
MLPs Bottlenecks
1. Dimensionality issue

 Rule of thumb: The number of


training samples should be at least 5
to 10 times the number of weights in
the network.
 Otherwise,the network is prone to
overfitting
2. Overfitting
2. Overfitting (cont’d)
3. The black-box syndrome

 A common criticism for ANN: The lack of


transparency/explainability
 Answer: sensitivity analysis
 Conducted on a trained ANN
 The inputs are perturbed while the
relative change on the output is
measured/recorded
 Results illustrate the relative importance
of input variables
sensitivity analysis
4. Vanishing gradient
problem
 In machine learning, the vanishing gradient problem is
encountered when training artificial neural networks with
gradient-based learning methods and backpropagation. In such
methods, during each iteration of training each of the neural
network's weights receives an update proportional to the
partial derivative of the error function with respect to the current
weight. The problem is that in some cases, the gradient will be
vanishingly small, effectively preventing the weight from changing
its value. In the worst case, this may completely stop the neural
network from further training. As one example of the problem cause,
traditional activation functions such as the hyperbolic tangent
function have gradients in the range (0,1], and backpropagation
computes gradients by the chain rule. This has the effect of
multiplying n of these small numbers to compute gradients of the
early layers in an n-layer network, meaning that the gradient (error
signal) decreases exponentially with n while the early layers train
very slowly.
Building Neural
Networks
 Architecture of a neural network is driven
by the task it is intended to address
 Classification, regression, clustering,
general optimization, association, ….
 Most popular architecture: Feedforward
multi-layered perceptron with
backpropagation learning algorithm
 Used for both classification and regression
type problems
 Others – Recurrent, self-organizing feature
maps, Hopfield networks, …
Development of NNs
Backpropagation
 Multi-layer networks use a variety of learning techniques, the most
popular being back-propagation.
 The output values are compared with the correct answer to compute the
value of some predefined error-function. By various techniques, the error
is then fed back through the network.
 The algorithm adjusts the weights of each connection in order to reduce
the value of the error function by some small amount.
 After repeating this process for a sufficiently large number of training
cycles, the network will usually converge to some state where the error
of the calculations is small.
 In this case, one would say that the network has learned a certain target
function. To adjust weights properly, one applies a general method for
non-linear optimization that is called gradient descent. For this, the
network calculates the derivative of the error function with respect to the
network weights, and changes the weights such that the error decreases
(thus going downhill on the surface of the error function).
 For this reason, back-propagation can only be applied on networks with
differentiable activation functions.
The steps Of The
Backpropagation
Initialize the weights:
 The weights in the network are
initialized to small random numbers
(e.g., ranging from−1.0 to 1.0, or−0.5
to 0.5).
 Each unit has a bias associated with it,
as explained later.
 The biases are similarly initialized to
small random numbers.
 Each training tuple, X, is processed by
the following steps.
Propagate the inputs
forward:
 First,the training tuple is fed to the
network’s input layer.
 The inputs pass through the input units,
unchanged.
 That is, for an input unit, j, its output, Oj, is
equal to its input value, Ij.
 Next, the net input and output of each unit in
the hidden and output layers are computed.
 The net input to a unit in the hidden or
output layers is computed as a linear
combination of its inputs.
The steps Of The
Backpropagation
 Propagate the inputs forward:
 Each hidden layer or output layer unit has a
number of inputs to it that are, in fact, the
outputs of the units connected to it in the
previous layer.
Propagate the inputs
forward
 To compute the net input to the unit, each input
connected to the unit is multiplied by its
corresponding weight, and this is summed.
 Given a unit, j in a hidden or output layer, the net
input, Ij, to unit j is

 where wij is the weight of the connection from unit i


in the previous layer to unit j; Oi is the output of
unit i from the previous layer; and θj is the bias of
unit j.
 The bias acts as a threshold in that it serves to vary
the activity of the unit.
Propagate the inputs
forward
 Each unit in the hidden and output layers takes its net
input and then applies an activation function to it.
 The function symbolizes the activation of the neuron
represented by the unit.
 The logistic, or sigmoid, function is used. Given the
net input Ij to unit j, then Oj, the output of unit j, is
computed as

 Thelogistic function is nonlinear and differentiable,


allowing the backpropagation algorithm to model
classification problems that are linearly inseparable.
Propagate the inputs
forward
 We compute the output values, Oj, for
each hidden layer, up to and including
the output layer, which gives the
network’s prediction.
 In practice, it is a good idea to cache (i.e.,
save) the intermediate output values at
each unit as they are required again later
when back propagating the error.
 This trick can substantially reduce the
amount of computation required.
Back propagate the
error
 Theerror is propagated backward by updating
the weights and biases to reflect the error of
the network’s prediction. For a unit j in the
output layer, the error Errj is computed by

 where Oj is the actual output of unit j, and Tj is


the known target value of the given training
tuple.
 Note that Oj(1−Oj) is the derivative of the
logistic function.
Back propagate the
error
 To compute the error of a hidden layer unit j,
the weighted sum of the errors of the units
connected to unit j in the next layer are
considered.
 The error of a hidden layer unit j is

 where wjk is the weight of the connection from


unit j to a unit k in the next higher layer, and
Errk is the error of unit k.
Back propagate the
error
 The weights and biases are updated to reflect the
propagated errors.
 Weights are updated by the following equations,
where delta(wij) is the change in weight wij:

 The variable l is the learning rate, a constant typically


having a value between 0.0 and 1.0.
 The learning rate helps avoid getting stuck at a local
minimum in decision space. If the learning rate is too
small, then learning will occur at a very slow pace. If
learning rate is too large, then oscillation.
Back propagate the
error
 Biasesare updated by the following equations, where
delta(θj) is the change in bias θj:

 The updating of the weights and biases after the


presentation of each tuple, referred to case updating.
 Alternatively, the weight and bias increments could
be accumulated in variables, so that the weights and
biases are updated after all the tuples in the training
set have been presented. (called epoch updating)
 Batch/mini-batch updating : weight and bias are
updated after several samples
 one iteration through the training set is an epoch.
Terminating condition
 Training stops when:
 All delta(wij) in the previous epoch are so small
as to be below some specified threshold, or
 The percentage of tuples misclassified in the
previous epoch is below some threshold, or
 A pre-specified number of epochs has expired.
 Inpractice, several hundreds of thousands
of epochs may be required before the
weights will converge.

You might also like