Error Back Propagation Algorithm
Error Back Propagation Algorithm
Lack of suitable training methods for multilayer perceptrons (MLP)s led to a waning of
interest in NN in 1960s and 1970s. This was changed by the reformulation of the
backPropagation training method for MLPs in the mid-1980s by Rumelhart et al.
Backpropagation was created by generalizing the Widrow-Hoff learning rule to multiple-
layer networks and nonlinear differentiable transfer functions. Standard
backpropagation is a gradient descent algorithm, as is the Widrow-Hoff learning rule, in
which the network weights are moved along the negative of the gradient of the
performance function. The term backpropagation refers to the manner in which the
gradient is computed for nonlinear multilayer networks.
As in simple cases of the delta learning rule training studied before, input patterns are
submitted during the back-propagation training sequentially. If a pattern is submitted
and its classification or association is determined to be erroneous, the synaptic weights
as well as the thresholds are adjusted so that the current least mean square classification
error is reduced. The input l output mapping, comparison of target and actual values,
and adjustment, if needed, continue until all mapping examples from the training set are
learned within an acceptable overall error. Usually, mapping error is cumulative and
computed over the full training set.
During the association or classification phase, the trained neural network itself operates
in a feedforward manner. However, the weight adjustments enforced by the learning
rules propagate exactly backward from the output layer through the so-called "hidden
layers" toward the input layer.
The input and output values of the network are denoted y j and o k , respectively. We thus,
denote yj, for j = 1, 2, . . . , J, and ok, for k = 1, 2, . . . , K, as signal values at the j'th
column of nodes, and k'th column of nodes, respectively. As before, the weight w kj
connects the output of the j'th neuron with the input to the k'th neuron.
Eqn. 1
The error expression generalized to include all squared errors at the outputs k=1,2,3…K
Eqn:2
3. Fixed input of value during both the training and recall phases
Minimization of error requires the weight changes to be in the negative gradient
direction. Individual weight adjustments are computed as follows
Eqn:3
Eqn:6
Eqn:7
Since
Eqn:8
Eqn:9
Eqn: 10
Eqn 10 represents the general formula for delta training/learning weight adjustments for
a single-layer network. It also follows that the adjustments of weight wkj is proportional
to the input activation yj, and to the error signal value at the kth neuron’s output.
The delta value needs to be explicitly computed for specifically chosen activation
functions.
Eqn: 11
Eqn: 12
Denoting the second term in the above equation as a derivative of activation function
Eqn: 13
And
Eqn: 14
And rewriting eqn 12 we have
Eqn: 15
Eqn 15 shows that the error signal term depicts the local error (d k -o k ) at the output of the
k’th neuron scaled by the multiplicative factor f’ k (net k ).
The final formula for the weight adjustment of the single-layer network can be obtained
from Eqn 10 as
Eqn: 16
Eqn 16 is identical to the delta training rule. The updated weight values become
Eqn: 17
Eqn: 18
Eqn: 19
or
Eqn: 20
The activation function in the case of bipolar continuous activation function is given by
We obtain
Verification of identity
Letting o=f(net)
LHS=RHS
The delta value for a bipolar continuous activation function is given by
The updated weights under the delta training rule for the single-layer network can be
expressed using the vector notation
where the error signal δo is defined as a column vector consisting of the individual error
signal terms
When the weights are being changed immediately after a training pattern is presented
then it is called as incremental approach.
When the weights are changed only after all the training patterns are presented then it is
called as batch mode. This mode requires additional local storage for each connection to
maintain the immediate weight changes.
But there are 2 such troughs in wl1 and wl2. So if the learning commences at point 2 we
may end up in a local minima instead of a global minima wg. Thus the trained network
will be unable to produce the desired performance in terms of its acceptable terminal
error. To ensure convergence to a satisfactory minimum the starting point should be
changed to 1.
The problem of local minima can however be avoided by inserting some form of
randomness to the training.
The weights of the network to be trained are typically initialized at small random values.
The initialization strongly affects the ultimate solution.
• If all weights start out with equal weight values, and if the solution requires that
unequal weights be developed the network may not train properly.
• Weights can’t be very high because the sigmoidal activation function used may
get saturated from the beginning itself and the system may be stuck at a local
minima or at a very flat plateau at the starting point itself
• One method of choosing the weight wij is choosing it in the range of
−3 3
oi
oi where oi is the number of processing elements j that feed-forward to
processing element i.
Since the weights are adjusted in proportion to the f’(net), the weights that are
connected to the midrange are changed the most. Since the error signals are
computed with f’(net) as multiplier, the back propagated errors are large for only
those neurons which are in the steep thresholding mode.
The other feature which is apparent from the graph is that for fixed learning constant
all adjustments in weight are in proportion to steepness coefficient. This observation
leads to a conclusion that using activation functions with larger values of λ may yield
results with larger learning constant. So it is advisable to keep λ fixed at 1 and
control only the learning constant, rather than controlling both.
Affects the convergence of BPA. A larger value of α speeds up the convergence but
might result in overshooting, while a smaller value of α results in overshooting and
vice versa. The learning constants should be chosen experimentally for each problem.
The range of learning constants are from 10-3to 10 have been reported throughout the
technical literature as successful for many computational back-propagation
experiments.
Based on the above observations some heuristics for improving the rate of
convergence are proposed.
Momentum Method
This method is used for accelerating the convergence of EBPTA. This method
involves supplementing the current weight adjustments with a fraction of most recent
weight adjustments. This is usually done according to the formula
where t and t-1 represents the current and most recent training step respectively and
a is user-selected positive momentum constant. This second term is called as
momentum term. For N steps using momentum method, the current weight is
expressed as
Starting from a simple case of single hidden layer the number of input nodes are
determined by the dimension, size of the input vector to be classified, generalized or
associated with a certain output quantity.
The input vector size corresponds to the number of inputs to be classified, generalized
or associated with a certain output quantity.
In planar images, size of input vector is sometimes made equal to the total number of
pixels in the evaluated images.
The conditions for selecting the number of output neurons depends on the type of
neural processing. In the case of auto-associator which associates the distorted input
vector with undistorted class prototype then we have I=K.
In the case of classifier the number of output neurons are equal to the number of
classes.
The number of Hidden neurons depends on the dimension n of the input vector and on
the number of separable regions in n-dimensional input space.