Lecture Notes Ling2019 1
Lecture Notes Ling2019 1
Lecture Notes Ling2019 1
Steve Ling
School of Biomedical Engineering
Faculty of Engineering and Information Technology
University of Technology, Sydney
0.8
0.6
VL L M H VH
0.4
0.2
0
0.5 1 1.5 2 2.5
Normailized heat rate
Fuzzy input2: Corrected QT interval
1
Output membership value
0.8
0.6
VL L M H VH
0.4
0.2
0
Fuzzy
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Normailized corrected QT interval Logic
UTS
March 2019
CONTENTS
ii
3.2.5 Defuzzification 51
References 70
References 96
iii
5.3 Generalised Delta Learning Rule (Error Back Propagation Training) 101
References 115
iv
References 137
7. GENETIC ALGORITHMS
7.1 Introduction to Genetic Algorithm 138
References 161
vi
CHAPTER ONE
INTRODUCTION TO NEURAL NETWORKS AND FUZZY
LOGIC
_______________________________________________________
Neural networks and fuzzy systems estimate functions from sample data. Deterministic and
statistical approaches also estimate functions, however they require mathematical models.
Neural networks and fuzzy systems are model-free estimators as they do not require the
development of system models such as transfer functions and state-space representations.
The operational framework of neural networks and fuzzy systems is symbolic.
Neural networks theory has its structure embedded in the mathematical fields of
dynamical systems, optimal and adaptive control, and statistics. Fuzzy theory
encompasses these fields and others such as probability, mathematical logic, and
nonlinear control. Applications of neural networks include high speed moderns, long
distance telephone calls, airport bomb detectors, medical imaging, biomedical signal
classification systems and handwritten character and speech recognition systems.
Applications of fuzzy systems include subway systems, elevator and traffic light scheduling
systems. Fuzzy systems are also used to auto-focus camcorders, smart home systems,
biomedical instrumentation, and to control smartly household appliances such as air
conditioners, washing machines, vacuum cleaners, and refrigerators.
McCulloch and Pitts outlined the first formal element of an elementary computing neuron in
1943. The connections between neurons in a network fundamentally determine the dynamics
of the network. For this reason, the field known today as Neural Networks was originally
called Connectionism. Networks of this type seemed appropriate for modelling not only
symbolic logic, but also perception and behaviour.
In 1988, Yann LeCun developed a fundamental convolutional neural network which named
LeNet5. The image features are distributed across the entire image, and convolutions with
learnable parameters are an effective way to extract similar features at multiple location with few
parameters. In 2014, Christian Szegedy from Google begun a quest aimed at reducing the
computational burden of deep neural networks, and devised the GoogLeNet the first Inception
architecture.
During the past several years, fuzzy logic control has emerged as one of the most active and
fruitful areas for research in the application of fuzzy set theory. Motivated by Zadeh's seminal
papers on the linguistic approach and system analysis based on the theory of fuzzy sets,
Mamdani and his colleagues pioneered the use of fuzzy logic control. Recent applications
have shown effective control using fuzzy logic can be designed for complex ill-defined control
systems without the knowledge of their underlying dynamics.
The important milestones in the development of fuzzy logic control may be summarised in
Table 1 . 2.
Back-propagation provides a way of using a target function to find the coefficients which make a
certain mapping function approximate the target function as closely as possible. The mapping
function in back-propagation is complex. It can be visualised as the computation carried
out by a fully connected three-layer feedforward network. The network consists of three
layers: the input, hidden, and output layers as shown in Figure 1.1.
Neural networks used for identification purposes typically have multilayer feedforward
architectures and are trained using the error back-propagation technique. The basic configuration
for forward plant identification is shown in Figure 1.2. The identification of the plant inverse is
another viable alternative for designing control systems. A neural network configuration using
inverse plant identification is shown in Figure 1.3.
Figure 1.4 shows the feedforward controller implemented using a neural network. Neurocontroller
B is an exact copy of neural network A, which undergoes training. Network A is connected so that
it gradually learns to perform as the unknown plant inverse. A closely related control architecture
for control and simultaneous specialised learning of the output domain is shown in Figure 1.5.
A fuzzy logic controller (FLC) can be typically incorporated in a closed loop control system as
shown in Figure 1.6. The main elements of the FLC are a fuzzification unit, an inference engine
with a knowledge base, and a defuzzification unit.
The Self-Organising Fuzzy Logic Controller (SOFLC) shown in Figure 1.7 which has a control
policy which can change with respect to the process it is controlling and the environment it is
operating in. The particular feature of this controller is that it strives to improve its performance
until it converges to a predetermined quality.
▪ The strategy for pattern learning or training (Hebbian learning, delta learning rule, etc)
The neuron is the fundamental building block of the biological network. Its schematic diagram
is shown in Figure 1.9. A typical cell has three major regions: the cell body (soma), the axon,
and the dendrites. Dendrites form a dendritic tree, which is a very fine bush of fibers around
the neuron's body. Dendrites receive information from neurons through long fibres called axons.
An axon is a long cylindrical connection that carries impulses from the neuron. The axon-
dendrite contact organ is called a synapse. The synapse is where the neuron introduces its signal
to the neighbouring neuron
The neuron is able to respond to the total of its inputs aggregated within a short time interval
called the period of latent summation. The neuron's response is generated if the total potential
of its membrane reaches a certain level. Incoming impulses can be excitatory if they cause the
firing of a neuron, or inhibitory if they hinder the firing of a response. A more precise condition
for firing is that the excitation should exceed the inhibition by the amount called the threshold
of the neuron, typically a value of about 40 mV. After carrying a pulse, an axon fibre is in a
state of completely nonexcitability for a certain time called the refractory period. The time
units for modelling biological neurons can be taken to be of the order of a millisecond. However,
the refractory period is not uniform over the cells.
The typical cycle time of neurons is about a million times slower than semiconductor gates.
Nevertheless, the brain can do very fast processing for tasks like vision, motor control, and
decisions even with access to incomplete and noisy data. This is obviously possible only
because billions of neurons operate simultaneously in parallel.
A basic neuron model is shown in Figure 1.10 (a) and its threshold T characteristic is shown
in Figure 1.10 (b). The firing rule for this model is defined as follows:
1 𝑛𝑒𝑡 ≥ 𝑇
𝑜 = 𝑓(𝑛𝑒𝑡) = { (1.1)
0 𝑛𝑒𝑡 < 𝑇
To consider the conditions necessary for the firing of a neuron. Incoming impulses can be
excitatory if they cause the firing, or inhibitory if the hinder the firing of the response. Notes
that wi = +1 for excitatory synapses and wi = −1 for inhibitory synapses.
10
There are two main types of neural networks and namely Feed-forward networks and
Recurrent/feedback networks. A summary of the architectures of neural networks are shown in
Figure 1.11.
Various feed-forward neural networks have been developed such as single-layer perceptron,
multilayer perceptron and radial basis function nets, etc. Feed-forward type networks receive
external signals and simply propagate these signals through all the layers to obtain the result
(output) of the neural network. There are no feedback connection previous layer.
Supervised learning, where the neuron (or neural network) is provided with a data set
consisting of input vectors and a target (desired output) associated with each input vector. This
data set is referred to as the training set. The aim of supervised training is then to adjust the
weight values such that the error between the real output of the neuron and the target output is
11
minimized. The learning algorithms of supervised learning are including LVQ, Perceptron,
Back-propagation, ARTMap, etc.
Unsupervised learning, where the aim is to discover patterns or features in the input data with
no assistance from an external source. Many unsupervised learning algorithms basically
perform a clustering of the training patterns. The learning algorithms of unsupervised learning
are including SOM, VQ, PCA, Hebb learning rule, etc.
12
In a traditional set theory, an item is either a member of a set or it is not. This two-valued logic
has proved to be very effective in solving well-defined problems, which are characterised by
precise descriptions of the process being dealt with in quantitative form. However, there is a
class of problems which are typically complex or ill-defined in nature where the concepts are
no longer clearly true or false, but are more or less false or most likely true. Fuzzy set theory
emerged as one effective approach to dealing with these problems. Developed in 1965 by Lotfi
Zadeh, the theory of fuzzy sets was introduced as an extension to traditional set theory, and
the corresponding fuzzy logic was developed to manipulate the fuzzy sets.
Fuzzy sets are defined in a universe of discourse. For a given universe of discourse U, a fuzzy
set is determined by a membership function which maps members of U on to a membership
range in the interval [0,1]. Associated with a classical binary or crisp set is a characteristic
function which returns 1 if the element is a member of that set and 0 otherwise.
13
Example 1.1
In the universe of discourse U = {2,3,4,5,6,7} , the fuzzy subset F labelled ‘integer close to
4’ may be defined as
F = 0. 33 / 2 + 0. 66 / 3 + 1. 0 / 4 + 0. 66 / 5 + 0. 33 / 6 + 0. 0 / 7
The support set of a fuzzy set F is the crisp set of all points u in U such that μ F ( u ) > 0. A
fuzzy set whose support is a single point in U is referred to as a fuzzy singleton. The support
set is said to be compact if it is a strict subset of the universe of discourse.
The membership for fuzzy sets can be defined numerically or as a function. A numerical
definition expresses the degree of membership function as a vector of numbers. A
functional definition defines the membership function in an analytic expression which
allows the membership grade for each element in the defined universe of discourse· to be
calculated. The membership functions which are often used include the triangular function,
the trapezoid function and the Gaussian function, as illustrated in Figure 1.13.
0 𝑓𝑜𝑟 𝑢 < 𝑎
(𝑢 − 𝑎)/(𝑏 − 𝑎) 𝑓𝑜𝑟 𝑎 ≤ 𝑢 ≤ 𝑏
𝜇𝐹 (𝑢) = { (1.5)
(𝑐 − 𝑢)/(𝑐 − 𝑏) 𝑓𝑜𝑟 𝑏 ≤ 𝑢 ≤ 𝑐
0 𝑓𝑜𝑟 𝑢 > 𝑐
14
0 𝑓𝑜𝑟 𝑢 < 𝑎
(𝑢 − 𝑎)/(𝑏 − 𝑎) 𝑓𝑜𝑟 𝑎 ≤ 𝑢 ≤ 𝑏
𝜇𝐹 (𝑢 ) = 1 𝑓𝑜𝑟 𝑏 ≤ 𝑢 ≤ 𝑐 (1.6)
(𝑑 − 𝑢)/(𝑑 − 𝑐) 𝑓𝑜𝑟 𝑐 ≤ 𝑢 ≤ 𝑑
{ 0 𝑓𝑜𝑟 𝑢 > 𝑑
(𝑢−𝑐)2
−
𝜇𝐹 (𝑢) = 𝑒 2𝜎2 (1.7)
15
Example 1.2
0 𝑓𝑜𝑟 𝑢 < 30
(𝑢 − 30)/15 𝑓𝑜𝑟 30 ≤ 𝑢 ≤ 45
𝜇𝐹 (𝑢) = {
(60 − 𝑢)/15 𝑓𝑜𝑟 45 ≤ 𝑢 ≤ 60
0 𝑓𝑜𝑟 𝑢 > 60
Let A and B be two fuzzy sets in U with membership functions A and B respectively. Some
basic fuzzy set operations are summarised as follows:
16
Example 1.3
Then
17
Let f:u→v and define A to be a fuzzy set on the universe of discourse U and
Example 1.4
then
f(A)=0.6/1+1/3+0.5/5 or V=0.6/1+1/3+0.5/5
One powerful aspect of fuzzy sets is the ability to deal with linguistic quantifiers or “hedges”.
Hedges such as very, more or less, not very, plus, etc. correspond to modifications in the
membership function as illustrated in Figure 1.15. Table 1.3 shows some fuzzy set operators
which can be used to represent some standard hedges. Note that the operator definition are not
unique and should be designed into appropriate forms before being used.
18
19
References
1. Brown, M., Harris, C. 1994, Neuro fuzzy Adaptive Modelling and Control,
Hertfordshire Prentice Hall.
2. Chai, R., Ling, S. H., San, P. P., Naik, G., Nguyen, N. T., Tran, Y., Craig, A., and
Nguyen, N. T. 2017, “Improving EEG-based driver fatigue classification using
sparse-deep belief networks,” Frontiers in Neuroscience, vol.11, Article103.
4. Ghevondian, N., Nguyen, H. T. 1997, ‘Using Fuzzy Logic Reasoning for Monitoring
Hypoglycaemia in Diabetic Patients’, 19th Annual International Conference, IEEE
Engineering in Medicine and Biology Society, 30 October – 2 November 1997,
Chicago, USA, pp. 1108-1111.
5. Jamshidi, M., Vadiee, N., Ross, T. J. 1993, Fuzzy Logic and Control - Software and
Hardware Applications, Prentice Hall, New Jersey.
7. Kosko, B. 1992, Neural Networks and Fuzzy Systems, Prentice Hall, New Jersey.
20
9. Ling, S. H., Leung, F. H. F., Lam, H. K., Tam, P. K. S. 2003, “Short-term electric load
forecasting based on a neural fuzzy network,” IEEE Trans. Industrial Electronics, vol.
50, no. 6, pp.1305–1316.
10. Ling, S. H., Iu, H. H. C., Leung, F. H. F., Chan K. Y. 2008, “Improved hybrid PSO-
based wavelet neural network for modelling the development of fluid dispensing for
electronic packaging,” IEEE Trans. Industrial Electronic, vol. 55, no. 9, pp. 3447–
3460, Sep. 2008.
11. Ling, S. H., Nguyen, H. T. 2011, “Genetic algorithm based multiple regression with
fuzzy inference system for detection of nocturnal hypoglycemic episodes,” IEEE Trans.
on Information Technology in Biomedicine, vol. 15, no. 2, pp. 308–315.
12. Ling, S. H., San, P. P., Chan K. Y., Leung, F. H. F., Liu, Y. 2014, “An intelligent
swarm based-wavelet neural network for affective mobile phone design,”
Neurocomputing, vol. 142, pp. 30-38.
13. Nguyen, H. T., Sands, D. M. 1995, ‘Self-Organising Fuzzy Logic Controller’, Control
95, The Institution of Engineers, Australia, 23-25 October 1995, Melbourne, vol 2, pp.
353-257.
14. Nguyen, H. T., King, L. M., Knight, G. 2004, ‘Real-Time Head Movement System
and Embedded Linux Implementation for the Control of Power Wheelchairs’, 26th
Annual International Conference of the IEEE Engineering in Medicine and Biology
Society, 1-5 September 2004, San Francisco, USA, pp. 4892-4895.
15. Nguyen, H. T., Nguyen, S. T., Taylor, P. B., Middleton J. 2007, ‘Head Direction
Command Classification using an Adaptive Optimal Bayesian Neural Network’,
International Journal of Factory Automation, Robotics and Soft Computing, Issue 3,
July 2007, pp. 98-103.
16. Smith, M. 1993, Neural Networks for Statistical Modelling. Van Nostrand Reinhold,
New York.
21
17. Ross T.J. 1995. Fuzzy Logic with Engineering Applications. McGraw-Hill.
18. Yan, J., Ryan, M., Power, J. 1994, Using Fuzzy Logic, Prentice Hall, Hertfordshire.
Zurada, J. M. 1992, Introduction to Artificial Neural Systems. West Publishing
Company, St. Paul.
22
CHAPTER TWO
FUNDAMENTAL CONCEPTS OF NEURAL NETWORKS
_______________________________________________________
The first formal definition of a synthetic neuron model was formulated by McCulloch and
Pitts (1943). The McCulloch-Fitts neuron model is shown in Figure 2.1. The firing rule for
this model is defined as follows
1 𝑖𝑓 ∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 ≥ 𝑇 1 𝑖𝑓 𝐰 ′ 𝐱 ≥ 𝑇
𝑧={ 𝑂𝑅 𝑧={ (2.1)
0 𝑖𝑓 ∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 < 𝑇 0 𝑖𝑓 𝐰 ′ 𝐱 < 𝑇
where w is the weight vector and x is the input vector.
𝑤1 𝑥1
𝐰 = [𝑤 𝑥
…2 ] and 𝐱 = [ …2 ]
𝑤𝑛 𝑥𝑛
Note that wi = +1 for excitatory synapses, wi = −1 for inhibitory synapses for this model, and
T is the neuron's threshold value.
Although this neuron model is very simplistic, it has substantial computing potential. It can
perform the basic logic operations NOT, OR, and AND, provided its weights and thresholds
are properly selected.
Example 2.1
Example of three-input NOR gates using the McCulloch-Pitts neuron model is shown in
Figure 2.2. Verify the implemented functions by compiling a truth table for this logic gate.
Solution 2.1
v1
y v2
x1 x2 x3 v1=w’x y v2 z
0 0 0 0 0 0 1
0 0 1 1 1 −1 0
0 1 0 1 1 −1 0
0 1 1 2 1 −1 0
1 0 0 1 1 −1 0
1 0 1 2 1 −1 0
1 1 0 2 1 −1 0
1 1 1 3 1 −1 0
24
Exercise 2.1
Example of three-input NAND gates using the McCulloch-Pitts neuron model is shown in
Figure 2.3. Verify the implemented functions by compiling a truth table for this logic gate.
25
2.1.2 Perceptrons
The McCulloch-Pitts model is based on several simplications. It allows only binary states
(0,1) and operates under a discrete-time assumption with synchronisation of all neurons in a
larger network. Weights and thresholds in a neuron are fixed and no interaction among
network neurons takes place except for signal flow.
A general perceptron consists of a processing element with synaptic input connections and a
single input. Its symbolic representation described in Figure 2.4 shows a set of weights and
the neuron's processing unit (node). The neuron output signal is given by:
𝑣 = ∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 = 𝐰 ′ 𝐱 (2.2)
𝑧 = 𝑓(𝑣) (2.3)
The function z = f(v) is often referred to as an activation function. Note that temporarily,
the threshold value is not explicitly used for convenience. We have assumed that
the modelled neuron has (n-1) actual synaptic connections associated with actual
variable inputs x1 ,x2 ,...,xn-1. We have also assumed that the last synapse is an
inhibitory one with wn = −1.
26
+1, 𝑣 > 0
𝑧 = 𝑓0 (𝑣) = sgn(𝑣) = { (2.4)
−1, 𝑣 < 0
Logistic function
1
𝑓1 (𝑣 ) = (2.5)
1+𝑒 −𝑣
𝑧 = 𝑓1 (v)
( )
= 𝑧(1 − 𝑧)
27
2 1−𝑒 −𝑣
𝑓2 (𝑣) = − 1 = 1+𝑒 −𝑣 = 2𝑓1 (𝑣) − 1 (2.6)
1+𝑒 −𝑣
𝑧= 𝟐( )
( 𝟐( ))
= 0.5(1 − 𝑧 2 )
1−𝑒 −2𝑣
𝑓3 (𝑣) = tanh(𝑣) = 1+𝑒 −2𝑣 = 2𝑓1 (2𝑣) − 1 (2.7)
𝑧= 𝟑( )
( 𝟑( ))
= 1 − tanh2 (z)
The soft-limiting activation functions f1(v), f2(v), f3(v) are often called sigmoidal
characteristics, as opposed to the hard-limiting activation function f0(v). A perceptron with
the activation function f0(v) describes the discrete perceptron shown in Figure 2.9. It was the
first learning machine introduced by Rosenbratt in 1958.
28
Example 2.2
Prove that if the logistic function z(v) = f1(v) is used as an activation function, then the
derivative of z(v) is given by:
𝑑𝑧
= 𝑧(1 − 𝑧) (2.8a)
𝑑𝑣
Solution 2.2
1 𝑒𝑣
𝑧(𝑣) = 𝑓1 (𝑣) = =
1 + 𝑒 −𝑣 𝑒 𝑣 + 1
𝑑𝑧 𝑒 𝑣 (𝑒 𝑣 + 1) − 𝑒 𝑣 (𝑒 𝑣 ) 𝑒𝑣 (𝑒 𝑣 )2
= = −
𝑑𝑣 (𝑒 𝑣 + 1)2 𝑒𝑣 + 1 (𝑒 𝑣 + 1)2
2
𝑒𝑣 𝑒𝑣
= 𝑣 −( 𝑣 ) = 𝑧 − 𝑧 2 = 𝑧(1 − 𝑧)
𝑒 +1 𝑒 +1
Example 2.3
If the bipolar logistic function z(v) = f2(v) is used as an activation function, then the
derivative of z(v) is given by:
𝑑𝑧
= 0.5(1 − 𝑧2 ) (2.8b)
𝑑𝑣
Solution 2.3
2 2𝑒 𝑣 𝑒 𝑣 −1
𝑧(𝑣) = 𝑓2 (𝑣) = − 1 = 𝑒 𝑣+1 − 1 =
1+𝑒 −𝑣 𝑒 𝑣 +1
𝑑𝑧 𝑒 𝑣 (𝑒 𝑣 + 1) − (𝑒 𝑣 − 1)𝑒 𝑣 2𝑒 𝑣
= =
𝑑𝑣 (𝑒 𝑣 + 1)2 (𝑒 𝑣 + 1)2
2𝑒 𝑣 1 (1 − 𝑧)
= 𝑣
× 𝑣 = (𝑧 + 1) = 0.5(1 − 𝑧 2 )
𝑒 +1 𝑒 +1 2
29
The neural network can be defined as an interconnection of neurons such that neuron
outputs are connected, through weights, to all other neurons including themselves with
both lag-free and delay connections allowed.
The mapping of the input vector x to the output vector z can be represented by
v = Wx
(2.9)
z = (v)
𝑥1 𝑣1 𝑧1
𝐱 = [ 𝑥…2 ] , 𝐯 = [ 𝑣…2 ] , 𝒛 = [ 𝑧…2 ]
𝑥𝑛 𝑣𝑚 𝑧𝑚
𝑤11 ⋯ 𝑤1𝑛 𝑣1
𝐖=[ ⋮ ⋱ ⋮ ], (v) = [𝑣…2 ] where () is an activation function.
𝑤𝑚1 ⋯ 𝑤𝑚𝑛 𝑣 𝑛
30
A recurrent network can be obtained from the feedforward network by connecting the
outputs of the neurons to their inputs as shown in Figure 2.11.
In the above recurrent network, the time elapsed between t and t + . is introduced by
the delay elements in the feedback loop. This time delay is analogous to the refractory
period of an elementary biological neuron model.
The mapping of the input vector x to the output vector z can be represented by
𝐯(𝑡 + Δ) = 𝐖𝐱(𝑡)
𝐳(𝑡 + Δ) = 𝐯(𝑡 + Δ) (2.10)
Recurrent networks typically operate with a discrete representation of data. They often
use neurons with a hard-limiting activation function. A system with discrete-time inputs
and a discrete data representation is called an automaton.
31
There are two different types of learning: supervised learning and unsupervised learning. In
supervised learning, the desired response d is provided by the trainer. The distance
[d,z]between the actual and the desired response serves as an error measure and is used to
correct the network parameters. The error can be used to modify weights so that the error
decreases. For this learning mode, a set of input and output patterns (training set) is required.
The general learning rule for neural network studies is: The weight Wi increases in
proportion to the product of input x and learning signal ri . The learning signal ri is in
general a function of Wi , x , and sometimes of the training signal d.
ri = ri(wi, x, d) (2.12)
The increment of the weight vector according to the general learning rule is
where
The illustration for general weight learning rules is given in Figure 2.12.
32
The Hebbian learning rule (1949) represents a purely feedforward, unsupervised learning.
The learning signal r is equal simply to the output of the neuron.
This learning rule requires the weight initialisation at small random values around wi = 0
prior to learning.
33
Exercise 2.2
Assume that the network shown in Figure 2.9 with the initial weight vector w(1) needs to be
trained using the set of three input vectors x(1), x(2), x(3) as below
1 1 0 1
−2
𝐱(1) = [ ], 𝐱(2) = [ −0.5 1 −1
] , 𝐱(3) = [ ] , and 𝐰(1) = [ ]
1.5 −2 −1 0
0 −1.5 1.5 0.5
If the activation function of this perceptron is the logistic function f1(v), the learning
constant is c = 1, and the Hebbian learning rule is used, show that the weight vectors after
subsequent training steps are:
The discrete perceptron learning rule (1958) is of supervised type as shown in Figure 2.13. The
learning signal r is the error between the desired and actual response of the neuron.
The weight adjustment is inherent zero when the desired and actual responses agree. The
weights are initialised at any values.
34
Exercise 2.3
Assume that the network shown in Figure 2.9 with the initial weight vector w(1) needs to be
trained using the set of three input vectors x(1), x(2), x(3) as below
1 0 −1 1
𝐱(1) = [−2], 𝐱(2) = [ 1.5 ] , 𝐱(3) = [ 1 ] , and 𝐰(1) = [−1]
0 −0.5 0.5 0
−1 −1 −1 0.5
The trainer’s desired responses for x(1), x(2), x(3) are d(1)=−1, d(2)= −1, d(3)=1 respectively.
If the learning constant is c=0.1, and the discrete perceptron learning rule is used, show that
the weight vectors after subsequent training steps are:
The delta learning rule is only valid for continuous activation function and in the supervised
learning mode. This learning rule can be readily derived from the condition of least squared
error between the output and the desired response.
vi = wi’ x
1 1 1 2
𝐸 = 𝑒𝑖 2 = (𝑑𝑖 − 𝑧𝑖 )2 = (𝑑𝑖 − 𝑓(𝑣𝑖 )) (2.18)
2 2 2
𝜕𝐸 𝜕𝐸 𝜕𝑧𝑖 𝜕𝑓 𝜕𝑓 𝑣𝑖
𝛻𝐸 = = = −(𝑑𝑖 − 𝑧𝑖 ) = −(𝑑𝑖 − 𝑧𝑖 ) (2.19)
𝜕𝑤𝑖 𝜕𝑧𝑖 𝜕𝑤𝑖 𝜕𝑤𝑖 𝜕𝑣𝑖 𝜕𝑤𝑖
𝜕𝑓
𝛻𝐸 = −(𝑑𝑖 − 𝑧𝑖 ) 𝐱
𝜕𝑣𝑖
35
Since the minimisation of the error requires the weight changes to be in the negative gradient
direction, we take
∆𝐰𝑖 = − 𝛻𝐸 (2.20)
where is a positive constant (learning rate).
Using the general learning rule (2.14), it can be seen that the learning constant c and the
learning rate are equivalent. The weights are initialised at any values and the learning
signal r can be found from
𝜕𝑓(𝑣(𝑘))
𝑟𝑖 (𝑘) = 𝑒𝑖 (𝑘) = 𝑒𝑖 (𝑘)𝑓 ′ (𝑣(𝑘)) = (𝑑𝑖 (𝑘) − 𝑧𝑖 (𝑘))𝑓 ′ (𝑣(𝑘)) (2.21)
𝜕𝑣(𝑘)
Exercise 2.4
Again, assume that the network shown in Figure 2.9 with the initial weight vector w(1)
needs to be trained using the set of three input vectors x(1), x(2), x(3) as below
1 0 −1 1
−2
𝐱(1) = [ ], 𝐱(2) = [ 1.5 1 −1
] , 𝐱(3) = [ ] , 𝑎𝑛𝑑 𝐰(1) = [ ]
0 −0.5 0.5 0
−1 −1 −1 0.5
The trainer’s desired responses for x(1), x(2), x(3) are d(1) = −1, d(2)= −1, d(3)=1
respectively. If the activation function of this perception is the bipolar logistic function f2 (v),
the learning constant is c = 0.1, and the delta learning rule is used, show that the weight
vectors after subsequent training steps are:
36
The Widrow-Hoff learning rule (1962) is applicable for the supervised training of neural
networks. It is independent of the activation function of neurons. The learning signal r is the
error between the desired output value d and the activation value of the neuron v.
This rule can be considered as a special case of the delta rule. Assuming that
𝑣𝑖 = 𝐰 ′ (𝑘)𝐱(𝑘), 𝑓(𝑣𝑖 ) = 𝑣𝑖 , 𝑓 ′ (𝑣𝑖 ) = 1. This rule is sometimes called the LMS (least mean
square) learning rule. The weights are initialised at any value.
Exercise 2.5
Again, assume that the network shown in Figure 2.9 with the initial weight vector w(1)
needs to be trained using the set of three input vectors x(1), x(2), x(3) as below
1 0 −1 1
𝐱(1) = [−2], 𝐱(2) = [ 1.5 ] , 𝐱(3) = [ 1 ] , 𝑎𝑛𝑑 𝐰(1) = [−1]
0 −0.5 0.5 0
−1 −1 −1 0.5
The trainer’s desired responses for x(1), x(2), x(3) are d(1) = −1, d(2)= −1, d(3)=1
respectively. If the activation function of this perception is the bipolar logistic function f2 (v),
the learning constant is c = 0.1, and the Widrow-Hoff learning rule is used, show that the
weight vectors after subsequent training steps are:
37
References
Kosko B. 1992. Neural Networks and Fuzzy Systems. New Jersey: Prentice Hall.
Hertz J., Krogh A., Palmer R. G. 1991. Introduction to the Theoary of Neural
Computing.Redwood City, California: Addison-Wesley.
Smith M. 1993. Neural Networks for Statistical Modelling. New York: Van Nostrand
Reinhold
Zurada, J.M. 1992. Introduction to Artificial Neural Systems. St. Paul: West
Publishing Company.
38
CHAPTER THREE
FUNDAMENTAL CONCEPTS OF FUZZY LOGIC AND
FUZZY CONTROLLER
_______________________________________________________
A fuzzy relation maps elements of one universe to one of another universe through the
Cartesian product of the two universes. The strength of the relation between ordered pairs of
the two universes is measured with the membership function expressing various degrees of
strength of the relation on the unit interval [0,1].
If A1, A2, …An are fuzzy sets in U1, U2, … Un respectively, the Cartesian product of A1,
A2,…, An is a fuzzy set F in the product space U1U2 … Un with the membership function:
where F= A1 A2 … An
Fuzzy Relation
An n-ary fuzzy relation is a fuzzy set in U1U2 … Un and expressed as:
39
Example 3.1
Let A be a fuzzy set defined on a universe of three discrete temperatures, T={t1, t2, t3}, and B
be a fuzzy set defined on a universe of two discrete pressure P={p1, p2}. Fuzzy set A
represents the “ambient” temperature and fuzzy set B represents the “near optimum” pressure
for a certain heat exchanger, and the Cartesian product might represent the conditions
(temperature-pressure pairs) of the exchanger that are associated with “efficient” operations.
Let
T=0.1/t1 + 0.6/t2 + 1/t3
P=0.4/p1 + 0.8/p2
0.1 0.1
R=TP = [0.4 0.6]
0.4 0.8
40
Example 3.2
In the armature control of a DC motor, suppose that the membership functions for both
armature resistance Ra (ohms), armature Ia (A), and motor speed N (rpm) are given in their
per unit values:
0.3 0.7 1.0 0.2
𝑅𝑎 = { + + + }
30 60 100 120
0.2 0.4 0.6 0.8 1.0 0.1
𝐼𝑎 = { + + + + + }
20 40 60 80 100 120
0.33 0.67 1.0 0.15
𝑁={ + + + }
500 1000 1500 1800
The fuzzy relation between armature resistance and armature current P= Ra Ia, and the fuzzy
relation between armature current and motor speed Q= Ia N can be calculated
41
In control system design, a PID controller is effective for a fixed control environment. In
order to cope with varying control environment or system non-linearity, an adaptive controller,
a self-tuning PID controller, a H ∞ controller or a sliding mode controller may be used. The
design of these controllers needs a mathematical model of the process in order to formulate
the input-output relation. Such models can be very difficult or very time consuming to be
identified.
In the fuzzy-logic-based approach, the inputs, outputs and control response are specified in
terms similar to those that might be used by an expert. Complex mathematical models of the
system under control are not required. Essentially, complicated knowledge based on the
experience of an expert can be incorporated in the fuzzy system in a relatively simple
way. Usually, this knowledge is expressed in the forms of rules.
Fuzzy logic and its applications in control engineering can be considered as the most important
area in fuzzy set theory and its applications. Since the invention of the first fuzzy controller by
Mamdani in 1974, fuzzy logic controllers (FLCs) have been successfully applied in numerous
industrial applications such as cement-kiln process control, automatic train operation,
camcorder autofocussing, crane control, etc
A fuzzy logic controller (FLC) can be typically incorporated in a closed loop control system as
shown in Figure 3.1. The main elements of the FLC are a fuzzification unit, an inference engine
with a knowledge base, and a defuzzification unit.
42
When a FLC is designed to replace a conventional PD controller, the input variables of the
FLC are error (e) and change of error (ce). The output variable of the FLC is a control signal
u.
The fuzzy set for each system variable are defined in typical linguistic terms such as
There are two ways to define the membership for a fuzzy set: numerical or functional. A
numerical definition expresses the degree of membership function of a fuzzy set as a vector
of numbers whose dimension depends on the level of discretisation in the universe of discourse.
A functional definition denotes the membership function of a fuzzy set in a functional form
such as the triangular function or a Gaussian function.
Figure 3.2 show typical fuzzy sets and membership functions of system variables error
e ,change of error ce, and controller output (plant input) u in numerical form. A functional
form membership for a fuzz set is shown in Figure 3.3.
43
3.2.2 Fuzzification
Fuzzification is the process of mapping from observed inputs to fuzzy sets in the various input
universes of discourse. In process control, the observed data is usually crisp, and fuzzification
is required to map the observed range of crisp inputs to corresponding fuzzy values for the
system input variables. The mapped data are further converted into suitable linguistic terms
as labels of the fuzzy sets defined for system input variables. This process can be expressed
by:
44
Example 3.3
Assume that the range of “error” is [-5V, 5V] and the fuzzy set “error” has 9 members [NVB,
NB, NM, NS, ZE, PS, PM. PB, PVB] with triangular membership functions shown in Figure
3.4.
Show that if
e = 2.25V
then
In a FLC, knowledge of the application domain and the control objectives is formulated
subjectively in most applications, based on an expert's experience. However, an
“objective” knowledge base may be constructed in a learning/self-organising environment by
using fuzzy modelling techniques.
The knowledge base consists of a data base and a rule base. The data base provides the
necessary definitions of the fuzzy parameters as fuzzy sets with membership functions defined
on the universe of discourse for each variable. The rule base consists of fuzzy control rules
intended to achieve the control objectives.
45
There are two main types of fuzzy inference rules in fuzzy logic reasoning: generalised modus
ponens (GMP) and generalised modus tollens (GMT). GMP is widely used in fuzzy logic
control applications and GMT is commonly used in expert systems, especially medical
diagnosis applications.
Consequence (Conclusion) : y is B’
Consequence (Conclusion) : x is A’
46
There are several sentence connectives such as AND, OR, and ALSO. The connectives AND
and OR are often used in the antecedent part, while connective ALSO is usually used in the
consequent part of fuzzy rules.
A fuzzy control algorithm should always be able to infer a proper control action for any input
in the universe of discourse. This property is referred to as 'completeness'. If the number of
fuzzy sets, or 'predicates', for each input variable is denoted by m and the number of system
input variables by n , then mn different rules are required for a completeness in the
conventional expert system approach. For example, if the number of fuzzy sets per system
input variable m is 7 and the number of input variables n is 3, then 73 = 343 rules are required.
In contrast with a conventional expert system, a FLC rule base typically only uses a small
number of rules to attain completeness in its behaviour. It has been found that the number of
control rules in a FLC can be remarkably reduced primarily due to the overlap of the fuzzy
sets and the soft matching approach used in fuzzy inference.
47
Example 3.4
Fuzzy logic is used to control a two-axis mirror gimbal for aligning a laser beam using a
quadrant detector. Electronics sense the error in the position of the beam relative to the centre
of the detector and produces two signals representing the x and y direction errors. The
controller processes the error information using fuzzy logic and provides appropriate control
voltages to run the motors which reposition the beam.
To represent the error input to the controller, a set of linguistic variables is chosen to represent
5 degrees of error, 3 degrees of change of error, and 5 degrees of armature voltage.
Membership functions are constructed to represent the input and output values' grades of
membership as shown in Figure 3.5.
Two sets of rules are chosen. These "Fuzzy Associative Memories" or FAMs, are a shorthand
matrix notation for presenting the rule set. A linguistic armature voltage rule is fired for each
pair of linguistic error variables and linguistic change in error variables.
A set of "pruned" rules is also used to investigate the effect of reducing the processed
information on the behaviour of the controller. When pruned, the FAM is slightly modified
to incorporate all the rules. The effect on the system response by modifying the FAM bank is
more dramatic than modifying the membership functions. Changing the FAM "coarsely" tunes
the response while adjusting the membership functions "finely" tunes the response. The FAMs
48
are shown in Table 3.1. Table 3.l (a) shows the full set of 15 fuzzy rules and Table 3.l (b)
shows the rule base fuzzy set after pruning.
(a) (b)
3.2.4 Reasoning Techniques
There are various ways in which the observed input values can be used to identify which
rules should be used and to infer an appropriate fuzzy control action. Among the various
fuzzy inference methods, the following are the most commonly used in industrial FLCs.
Due to the nature of industrial process control, it is often the case that the input data are crisp.
Fuzzification typically involves treating these as fuzzy singletons, which are then used with a
fuzzy inference method. Assume that the fuzzy control rule base has only two rules:
Let the fire strength of the ith rule be denoted by i;. For inputs x0 and y0 , the fire strength
1 and 2 can be calculated from
1=A1(x0) B1(y0)
2=A2(x0) B2(y0) (3.7)
49
In MAX-MIN fuzzy reasoning, Mamdani's minimum operation rule Rc is used for fuzzy
implication. The membership of the inferred consequence C is point-wise given by
Figure 3.6 shows the MAX-MIN inference process for the crisp input values x0 and y0 which
have been regarded as fuzzy singletons.
Figure 3.7 shows the MAX- PRODUCT inference process for the crisp input values x0 and
y0.
3.2.5 Defuzzification
Defuzzification is the process of mapping from a space of inferred fuzzy control actions to a
space of non-fuzzy (crisp) control actions. A defuzzification strategy is aimed at producing
a non-fuzzy control action that best represents the possibility distribution of the inferred
fuzzy control action. This can be expressed by:
u = defuzzifter(U) (3.10)
The MOM strategy (height defuzzification) generates a control action which represents the
mean value of all local control actions whose membership functions reach the maximum.
Let the number of rules be denoted by n, the maximum height of the membership function of
the fuzzy set defined for the output control (consequent) of the i-th rule by the crisp value Hi,
the corresponding crisp control value along the output universe of discourse by Ui and the fire
strength from the i-th rule by i. Then the crisp control value u* defuzzified using the MOM
method is given by:
∑𝑛
𝑖=1 𝛼𝑖 𝐻𝑖 𝑈𝑖
𝑢∗ = ∑𝑛
(3.11)
𝑖=1 𝛼𝑖 𝐻𝑖
The crisp value Ui is a support value at which the membership function reaches
maximum Hi (most often Hi = 1). In addition, although the fire strength from the i-th rule i
is normally calculated as described in Equation 3.7, a more effective method for calculating
the fire strength in a MOM method is
Assuming Hi = 1,
∑𝑛
𝑖=1 𝛼𝑖 𝑈𝑖
𝑢∗ = ∑𝑛
(3.13)
𝑖=1 𝛼𝑖
51
The COA strategy generates the centre of gravity of the possibility distribution of a control
action.
Let the number of rules be denoted by n, the amount of control output of the i-th rule ui, and
its corresponding membership value in the output fuzzy set C 𝜇𝐶 (𝑢𝑖 ). Then the crisp control
value u* defuzzified using the COA method is given by:
∑𝑛
𝑖=1 𝜇𝐶 (𝑢𝑖 )𝑢𝑖
𝑢∗ = ∑𝑛
(3.14)
𝑖=1 𝜇𝐶 (𝑢𝑖 )
If the universe of discourse is continuous, then the COA strategy generates an output control
action of
∫ 𝜇𝐶 (𝑢)𝑢𝑑𝑢
𝑢∗ = (3.15)
∫ 𝜇𝐶 (𝑢)𝑑𝑢
52
Example 3.5
In a control system, the membership functions of system variables error e , change of error
ce, and controller output (plant input) u are shown Figure 3.8.
Assume that the fuzzy rules are represented by a FAM Table (Table 3.2), and the current
values of error e and change of error ce are {e = 1.5, ce = −0.05}. Assume also that the ranges
of error, change of error, and controller output are [−3,3], [−1,1], and [−6,6] respectively.
error e
change of error ce
Controller output u
53
Solution 3.5
Step 1: Fuzzification
The current values of error e and change of error ce are {e = 1.5, ce = −0.05}
In fuzzy notation:
E1.5=0.25/ZE+0.75/P
CE-0.05=0.1/N+0.9/ZE
Step 2: Reasoning
E CE U (MOM) U(COA)
Step 3: Defuzzification
According to the COA method, the output u* can be found using the max-min inference
process as shown in below
∫ 𝜇𝐶 (𝑢)𝑢𝑑𝑢
𝑢∗ =
∫ 𝜇𝐶 (𝑢)𝑑𝑢
−3.6 −3 1 3 6
∫−6 0.1𝑢𝑑𝑢 + ∫−3.6(0.25𝑢 + 1)𝑢𝑑𝑢 + ∫−3 0.25𝑢𝑑𝑢 + ∫1 (0.25𝑢)𝑢𝑑𝑢 + ∫3 0.75𝑢𝑑𝑢
= 6 −3.6 1 3 6
∫−3.6 0.1𝑑𝑢 ∫−3 (0.25𝑢 + 1)𝑑𝑢 + ∫−3 0.25𝑑𝑢 + ∫1 (0.25𝑢) 𝑑𝑢 + ∫3 0.75𝑑𝑢
54
−3.6 −3 1 3 6
𝑢2 𝑢3 𝑢2 𝑢2 𝑢3 𝑢2
(0.1 2 ) + (0.25 3 + 2 ) + (0.25 2 ) + (0.25 3 ) + (0.75 2 )
−6 −3.6 −3 1 3
= −3 3
𝑢2 𝑢2
(0.1𝑢)−3.6
−6 + (0.25 + 𝑢) + (0.25𝑢)1−3 + (0.25 ) + (0.75𝑢)63
2 −3.6
2 1
55
To control any physical variable, we must first measure it. The system for measurement of the
controlled signal is called a sensor. The physical system under control is called a plant. In a
closed-loop control system, certain forcing signals of the system (the inputs) are determined
by the responses of the system (the outputs). To obtain satisfactory responses and
characteristics for the closed-loop control system, it is necessary to connect an additional
system, known as a compensator, or a controller, to the loop. The general form of a closed-
loop control system is illustrated in Figure 3.10. The control problem is stated as follows. The
output, or response, of the physical system under control (i.e., the plant) is adjusted as required
by the error signal. The error signal is the difference between the actual response of the plant,
as measured by the sensor system, and the desired response, as specified by a reference input.
56
The knowledge-base module in Figure 3.11 contains knowledge about all the input and output
fuzzy partitions. It will include the term set and the corresponding membership functions
defining the input variables to the fuzzy rule-base system and the output variables, or control
actions, to the plant under control.
The steps in designing a simple fuzzy control system are as follows:
1. Identify the variables (inputs, states, and outputs) of the plant.
2. Partition the universe of discourse or the interval spanned by each variable into a
number of fuzzy subsets, assigning each a linguistic label (subsets include all the
elements in the universe).
3. Assign or determine a membership function for each fuzzy subset.
4. Assign the fuzzy relationships between the inputs’ or states’ fuzzy subsets on the one
hand and the outputs’ fuzzy subsets on the other hand, thus forming the rule-base.
57
5. Choose appropriate scaling factors for the input and output variables to normalize the
variables to the [0, 1] or the [−1, 1] interval.
6. Fuzzify the inputs to the controller.
7. Use fuzzy approximate reasoning to infer the output contributed from each rule.
8. Aggregate the fuzzy outputs recommended by each rule.
9. Apply defuzzification to form a crisp output.
The following example shows the flexibility and reasonable accuracy of a typical application
in fuzzy control.
Example 3.6
We will conduct a simulation of the final descent and landing approach of an aircraft. The
desired profile is shown in Figure 3.12. The desired downward velocity is proportional to the
square of the height. Thus, at higher altitudes, a large downward velocity is desired. As the
height (altitude) diminishes, the desired downward velocity gets smaller and smaller. In the
limit, as the height becomes vanishingly small, the downward velocity also goes to zero. In
this way, the aircraft will descend from altitude promptly but will touch down very
gently to avoid damage.
The two state variables for this simulation will be the height above ground, h, and the vertical
velocity of the aircraft, v (Figure 3.13). The control output will be a force that, when applied
to the aircraft, will alter its height, h, and velocity, v. The differential control equations are
loosely derived as follows. See Figure 3.14. Mass, m, moving with velocity, v, has momentum,
58
p = mv. If no external forces are applied, the mass will continue in the same direction at the
same velocity. If a force, f, is applied over a time interval Δt, a change in velocity of Δv = fΔt/m
will result. If we let Δt = 1.0 (s) and m = 1.0 (Ib s2 ft−1), we obtain Δv = f (lb), or the change in
velocity is proportional to the applied force.
59
60
Step 2. Define a membership function for the control output, as shown in Table 3.5 and
Figure 3.17.
Step 3. Define the rules and summarize them in an FAM table (Table 3.6). The values in the
FAM table, of course, are the control outputs.
Step 4: Define the initial conditions and conduct a simulation for four cycles. Because the
task at hand is to control the aircraft’s vertical descent during approach and landing, we will
61
start with the aircraft at an altitude of 1000 feet, with a downward velocity of −20 ft s−1. We
will use the following equations to update the state variables for each cycle:
vi+1 = vi + fi,
hi+1 = hi + vi.
Initial height, h0 : 1000 ft
Initial velocity, v0 : −20 fts−1
Control f0 to be computed
Height h fires L at 1.0 and M at 0.6 (h1000 = 0.6/M+1.0/L)
Velocity v fires only DL at 1.0 (v-20 = 1.0/DL)
WE defuzzify using COA and get f0=5.8lb. This is the output force computed from initial
conditions. The results for cycle 1 appear in Figure 3.18.
Figure 3.18 Truncated consequents and union of fuzzy consequent for cycle 1
62
Now, we compute new values of the state variables and the output for the next cycle:
h1 = h0 + v0 = 1000 +(−20) = 980ft,
v1 = v0 + f0 = −20 + 5.8= −14.2fts−1.
We defuzzify using COA and get f1= − 0.5lb. Results are shown in Figure 3.19.
f1= −0.5
Now, we compute new values of the state variables and the output for the next cycle.
63
We defuzzify using COA and get f2= − 0.4lb. Results are shown in Figure 3.20.
f2= −0.4
64
We defuzzify using COA and get f3= 0.3lb. Results are shown in Figure 3.21.
f3= 0.3lb
The summary of the four-cycle simulation results is presented in Table 3.7. If we look at the
downward velocity versus altitude (height) in Table 3.7, we get a descent profile that appears
to be a reasonable start at the desired parabolic curve shown in Figure 3.12 at the beginning of
the example.
Table 3.7 Summary of four-cycle simulation results
65
The Self-Organising Fuzzy Logic Controller (SOFLC) has a control policy which can
change with respect to the process it is controlling and the environment it is operating in.
The particular feature of this controller is that it strives to improve its performance until it
converges to a predetermined quality.
The SOFLC has to perform two tasks simultaneously: to observe the environment while
issuing the appropriate control actions and to use the results of these control actions to
improve them further. In other words, the function of the SOFLC is one of combined system
identification and control.
The ability of the SOFLC to carry out system identification makes it unnecessary to have a
detailed understanding of the environment. The advantage of this technique lies in the fact
that only a minimal amount of information about the environment is required. It is also
useful when the system under control is subject to time-varying parameter changes and
unknown disturbances.
The learning module, which contains a performance index table (learning rule base table)
and a rule generation and modification algorithm, is responsible for creating new rules or
modifying existing ones.
In a SOFLC, the control state of a process is monitored by the learning module. When an
undesirable output of the process is detected, the fuzzy control rules are created or
modified based on the corrections given by the performance index table.
66
The performance index table relates the state of the process to the deviation from its
desired behaviour, and defines the corrections required for the FLC to bring the system to the
desired states. Depending on the structure of a FLC, a performance index table can be defined
linguistically or expressed quantitatively using such performance indices such as error,
mean square error, maximum absolute error, and averaged error of the system variables.
Typically, the performance index table is derived from a general set of linguistic rules
which express the desired control trajectories in the state-space of the system. Table 3.8
gives an example of a performance index table.
67
To bring the system back to the desired state, the control output of the FLC will have to be
changed into: (et-d , cet-d , ut-d + P(et ,cet ))
Based on this group of system states, a new fuzzy control rule can be formulated. The
algorithm for rule base generation/modification will check if a fuzzy control rule exists
under this system state. If not, a new rule will be added to the rule base. Otherwise, the
existing rule will be modified into a newly formulated rule. If P(et ,cet ) = 0, the system
performance is satisfactory. No rule generation or modification should take place in the
current system state.
Let j-th modification rule in the performance modification module be denoted by:
68
For a SOFLC with N basic control rules and M performance modification rules, the rule
modifications are often carried out in the following procedures
1. Calculate the fire strength i for each control rule
2. Perform the fuzzy reasoning ui for each control rule
3. Calculate the performance modification P
4. Find the dominant rule which contributes most to the control action
5. If no rule is found, create a new rule in the control rule base
6. If the k-th rule is found, modify rule k using a new control output.
3.4.5 Remarks
In a SOFLC, the fuzzy control rules depend strongly on the performance index table. The
performance index table is often designed heuristically based on intuitive understanding of
the process. However, it is not trivial to design a performance index table which can
represent a desired output response exactly. That may be one of the reasons that several
other approaches have been proposed and implemented including the following:
69
References
Jamshidi, M., Vadiee, N., Ross, T.J. 1993, Fuzzy Logic and Control, Prentice Hall, New
Jersey.
Kosko B. 1992. Neural Networks and Fuzzy Systems. New Jersey: Prentice Hall. Jamshidi
Ross T.J. 2017. Fuzzy Logic with Engineering Applications (4th edition). McGraw-Hill.
Wang, L. X. 1994, Adaptive Fuzzy Systems and Control, Prentice Hall, New Jersey.
Yager, R.R., Filev, D.P. 1994, Essentials of Fuzzy Modelling and Control, John Wiley,
New York.
Yan, J., Ryan, M., Power, J. 1994, Using Fuzzy Logic, Prentice Hall, Hertfordshire.
70
CHAPTER FOUR
SINGLE-LAYER FEEDFORWRD NEURAL NETWORKS
AND RECURRENT NEURAL NETWORK
_______________________________________________________
One of the most useful tasks which can be performed by networks of interconnected nonlinear
elements is pattern classification. A pattern is the quantitative description of an object, event,
or phenomenon. The classification may involve spatial and temporal patterns. Examples of
spatial patterns are pictures, weather maps, fingerprints, and characters. Examples of
temporal patterns include speech signals, electrocardiograms, and seismograms. Temporal
patterns usually involve ordered sequences of data appearing in time.
The goal of pattern classification is to assign a physical object, event, or phenomenon to one
of the pre-specified classes (or categories). Typical classification tasks required from a
human being have been classification of the environment into groups such as living species,
plants, weather conditions, minerals, tools, human faces, voices, etc. The interpretation of
data has been learned gradually as a result of repetitive inspecting and classifying of examples.
A classifying system consists of an input transducer providing the input pattern data to the
feature extractor as shown in Figure 4.1. Typically, inputs to the feature extractor are sets of
data vectors which belong to a certain category. Usually, the converted data at the output of
the transducer can be compressed without loss of essential information. The compressed data
are called features.
71
Two simple ways to generate the pattern vector for cases of spatial and temporal objects are
shown in Figure 4.2. In Figure 4.2(a), each component xi of the vector 𝐱 =
[𝑥1 𝑥2 ⋯ 𝑥𝑛 ]′ is assigned the value 1 if the i-th cell contains a portion of a spatial object,
otherwise the value 0 is assigned. In the case of a temporal object being a continuous function
of time t as in Figure 3.2(b), the pattern vector may be formed by letting xi = f (ti), i = 1,
2, …, n.
72
Assume that a set of n-dimensional patterns x1,x2 , ··· ,xP and the desired classification for
each pattern (R categories) are known. In the classification step, the membership in a category
needs to be determined by a classifier based on the comparison of R discriminant functions
g1(x), g2(x),...,gr(x).
Within the region Hi, the i-th discriminant function gi(x) will have the largest value.
Example 4.1
Six patterns in 2-dim pattern space shown in Figure 4.3 need to be classified according to
their membership in sets as follows
[0,0],[-0.5,-1],[-1,-2] : Class 1
[2,0],[1.5,-1],[1,-2] : Class 2
Inspection of the patterns indicates that the equation for the decision surface can be arbitrarily
chosen
g(x) = −2x1+x2+2 (4.2)
It is obvious that g(x) > 0 and g(x) < 0 in each of the half-planes containing patterns of Class
1 and Class 2 respectively, and g(x) = 0 for all points on the line.
Note also that the decision surface equation g(x) can be derived from
g(x)=g1(x) − g2(x) (4.5)
73
A basic pattern classifier is shown in Figure 4.4. For a given pattern, the i-th discriminator
computes the value of the function gi(x). The maximum selector implements condition (4.1)
and selects the largest of all inputs, thus yielding the response equal to the category number
io.
74
A special case of classifiers is the dichotomiser where there are only categories (R = 2). A
single threshold logic unit (TLU) can be used to build such a simple dichotomiser as shown
in Figure 4.5.
75
In the linear classification case as shown in Figure 4.7, there are two clusters of patterns with
each cluster belongs to one known category. The central points (prototype points) P1 and P2
of Class 1 and Class 2 clusters are vectors x1 and x2 respectively. These points can be
interpreted as centres of gravity for each cluster. The decision hyperplane should contain the
midpoint of the line segment connecting the two central points, and should be normal to the
vector (x1−x2), which is directed toward P1.
1
𝑔(𝑥) = (𝐱1 − 𝐱 2 )𝑇 𝐱 + (‖𝑥2 ‖2 − ‖𝑥1 ‖2 ) = 0 (4.6)
2
𝑔(𝑥) = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 +𝑤𝑛+1 = 0
The weighting coefficients of the dichotomiser can be obtained easily from Eqs. (4.6) − (4.7)
as follows:
𝐰 = 𝐱1 − 𝐱 2 (4.8)
1
𝑤𝑛+1 = (‖𝑥2 ‖2 − ‖𝑥1 ‖2 )
2
76
Assume that a minimum distance classification is required to classify patterns into one of the
R categories. Each of the R classes is represented by the central points P1, P2, ,…PR which
correspond to vectors x1, x2,···,xR respectively. The Euclidean distance between the input
pattern x and the prototype pattern vector xi is
‖𝐱 − 𝐱 𝑖 ‖ = √(𝐱 − 𝐱𝑖 )𝑇 (𝐱 − 𝐱𝑖 ) (4.9)
‖𝐱 − 𝐱 𝑖 ‖𝟐 = 𝐱 𝐓 𝐱 − 𝟐𝐱𝑖 𝑻 𝐱 + 𝐱𝑖 𝑻 𝐱𝑖 (4.10)
Note that choosing the largest of the term (𝐱𝑖 𝑇 𝐱 − 0.5𝐱𝑖 𝑇 𝐱 𝒊 ) is equivalent to choosing the
smallest of the distance ‖𝐱 − 𝐱𝒊 ‖ as the term 𝐱 𝐓 𝐱 is independent of i and shows up in each
of the R distances. Therefore, for a minimum-distance classifier, a discriminant function
gi(x) can be found
The decision surface Sij for the contiguous decision regions Hi, Hj is a hyperplane given by
the equation
77
Example 4.2
Assume that the prototype points are as shown in Figure 4.9 and their coordinates are 𝑃1 =
10 2 −5
[ ] , 𝑃2 = [ ] , 𝑃3 = [ ] and then design a linear (minimum-distance) classifier.
2 −5 5
10 2 −5
𝐰1 = [ 2 ], 𝐰2 = [ −5 ] , 𝒘3 = [ 5 ]
−52 −14.5 −25
78
79
It is possible for neural network classifiers to derive their weights during the learning cycle.
The sample pattern vectors x1,x2 , ···, xp, called the training sequence, are presented to a
classifier along with the correct response. The classifier modifies its parameters by means of
iterative, supervised learning. The network learns from experience by comparing the targeted
correct response with the actual response.
A supervised training procedure for the dichotomiser as shown in Figure 4.10 can be
developed as follows. The dichotomiser consists of (n+1) weights and the TLU (threshold
logic unit). It is identical to the binary bipolar perceptron.
80
The normal vector will always point toward the side of the space for which 𝐰 T 𝐲 > 𝟎. This
side is called the positive side, or positive semispace, of the hyperplane.
Figure 4.11 shows a decision surface for the training pattern y1 in the augmented weight space
of the discrete perceptron (Figure 4.10).
81
𝐰 2 = 𝐰1 + 𝑐𝐲1 (4.18)
Case B illustrates a similar misclassification with the initial weight and pattern y1 of Class 2
being input. Obviously, the pattern is misclassified due to g1(y) = (wl)Ty1> 0. To decrease the
discriminant function g1(y1), the weight vector should be adjusted in the direction of steepest
decrease.
𝐰 2 = 𝐰1 − 𝑐𝐲1 (4.19)
The supervised training procedure can be summarised using the following expression for the
augmented weight vector:
𝐰 ∗ = 𝐰 ± 𝑐𝐲 (4.20)
where the positive sign applies for undetected pattern of Class 1, and the negative sign for
undetected pattern of Class 2. If a correct classification takes place, no adjustment of weights
is made.
82
|𝐰 𝑻 𝐲|
𝑝= ‖𝒚‖
(4.22)
Note that if the correction constant c is selected such that the corrected weight vector w* is
placed on the decision hyperplane wTy = 0.This implies that
𝐰𝑻𝐲
𝑐= (4.24)
𝐲𝑻𝐲
or more conveniently
|𝐰 𝑻 𝐲|
𝑐= (4.25)
𝐲𝑻𝐲
|𝐰 𝑻 𝐲|
‖𝑐𝐲‖ = ‖𝐲‖ (4.26)
𝐲𝑻𝐲
Note that the distance p from the point w to the decision plane is identical to the length of
the weight incremental vector. Using this technique, the correction increment cis not constant
and depends on the current training pattern.
where the coefficient is the ratio of the distance between the old weight vector w and the
new weight vector 𝐰 ∗ , to the distance from w to the hyperplane in the weight space. Note
that 0 < <1 for fractional correction rule and 1< < 2 for absolute correction rule.
83
Example 4.3
1 −0.5 3 −2
𝐲1 = [ ], 𝐲2 = [ ] , 𝐲3 = [ ] , 𝐲4 = [ ]
1 1 1 1
Using the fixed correction rule in eq. (4.27), an arbitrary selection of c =1, and the initial
weight chosen arbitrary as
−2.5
𝐰1 = [ ]
1.75
show that
−1.5 −1 2.0
𝐰2 = [ ] , 𝐰3 = [ ] , 𝐰4 = [ ]
2.75 1.75 2.75
Since we have no evidence of correct classification for weights 𝐰 5 , the training set consisting
of an ordered sequence of patterns 𝐲1 , 𝐲2 , 𝐲3 , 𝐲4 needs to be recycled. Therefore, we have
𝐲5 = 𝐲1 , 𝐲6 = 𝐲2 ,𝐲7 = 𝐲3 ,𝐲8 = 𝐲4 , etc.
3
Show that the final weights of the weight vector is 𝐰11 = [ ]
0.75
and that this weight vector provides correct classification of the entire training set.
84
Figure 4.13 Discrete perceptron classifier training (fix correction rule training)
In this section, the continuous perceptron as shown in Figure 4.14 is considered. The
introduction of the continuous perceptron is to gain finer control over the training procedure
and to facilitate working with differentiable characteristics of the threshold element, thus
enabling computation of the error gradient.
85
The training is based on the steepest descent technique. Starting from an arbitrary chosen
weight vector w, the gradient E(w) of the current vector is calculated. The adjusted weight
vector w* is obtained by moving in the direction of the negative gradient along the multi-
dimensional error surface. An example of an error surface is shown in Figure 4.15.
w* = w −E(w) (4.29)
1
𝐸𝑘 = (𝑑𝑘 − 𝑧𝑘 )2 (4.30)
2
with
vk = (wk)Tyk, zk = f(vk)
The error minimisation algorithm can be calculated using the chain rule as follows:
𝜕𝐸𝑘 𝜕𝐸𝑘 𝜕𝑧 𝑘 𝜕𝑣 𝑘
∇𝐸𝑘 (𝐰 𝑘 ) = = (4.31)
𝜕𝐰 𝑘 𝜕𝑧 𝑘 𝜕𝑣 𝑘 𝜕𝐰 𝑘
𝜕𝑧 𝑘
∇𝐸𝑘 (𝐰 𝑘 ) = −(𝑑𝑘 − 𝑧𝑘 ) 𝐲 (4.32)
𝜕𝑣 𝑘 𝑘
86
𝜕𝑧
w* = w +(d−z) 𝐲 (4.33)
𝜕𝑣
It can be seen that this rule is equivalent to the delta training rule (2.21). The calculation
of adjusted weights requires an arbitrary choice of and the specification for the activation
function z = f (v) used.
A significant difference between the discrete and continuous perceptron training is that the
discrete perceptron training algorithm always leads to a solution for linear separable
problems. In contrast, the negative gradient-based continuous perceptron training does not
guarantee solutions for linearly separable patterns.
Example 4.4
1−𝑒 −𝑣
A continuous perceptron with a bipolar activation function as 𝑓2 (𝑣) = shown in
1+𝑒 −𝑣
Figure 4.16 has to be trained to recognise the following classification of four patterns x with
known class membership d.
1 −0.5 3 −2
𝐲1 = [ ], 𝐲2 = [ ] , 𝐲3 = [ ] , 𝐲4 = [ ]
1 1 1 1
87
𝜕𝑧 𝜕𝑓2 (𝑣)
Note that for bipolar activation function = = 0.5(1 − 𝑧 2 ). Using the delta
𝜕𝑣 𝜕𝑣
training rule in Equation (4.33), an arbitrary selection of = 0.5, and the initial
weights chosen arbitrary as
−2.5
𝐰1 = [ ]
1.75
show that
−2.204 −2.1034
𝐰2 = [ ] , . . . , 𝐰4 = [ ]
2.046 1.9912
Since we have no evidence of correct classification for weights 𝐰 4 , the training set consisting
of an ordered sequence of patterns 𝐲1 , 𝐲2 , 𝐲3 , 𝐲4 needs to be recycled. Therefore, we have
𝐲5 = 𝐲1 , 𝐲6 = 𝐲2 ,𝐲7 = 𝐲3 ,𝐲8 = 𝐲4 , etc.
3.1514
Show that the weights vector after 40 cycles (160 steps) is 𝐰16 = [ ]
−0.6233
and that this weight vector provides correct classification of the entire training set.
88
To apply the error-correcting algorithm to the task of multi-category classification, the linear
classifier in Figure 4.8 can be modified to include discrete perceptrons as shown in Figure
4.18. The assumption needed is that classes are linearly separable. Direct supervised training
of this network can be performed in a similar manner as in Section 4.1.6.
For example, using the fixed correction rule, the weight adjustment for this network is
where di and zi are the desired and actual responses of the i-th discrete perceptron
respectively.
Note that we have been using yn+l = 1. It is important to note that from the training viewpoint,
any constant value of yn+l is also appropriate. However when yn+l = −1, the wn+l value becomes
equal to the actual firing threshold of the neuron with input being the original pattern x.
𝐱
𝐲=[ ] with wn+l = T
−1
For R-category classifiers with local representation, the desired response or the training
pattern of the i-th category is
For R-category classifiers with distributed representation, the desired response or the training
pattern of the i-th category is not required, as more than a single neuron is allowed to
respond +1 in this mode.
89
Example 4.5
10
Similar to Example 4.2, assume that coordinates of the prototype points are 𝐱1 = [ ] , 𝐱2 =
2
2 −5
[ ] , 𝐱3 = [ ], design a Linear classifier using 3 discrete perceptron.
−5 5
Also, the training set consisting of an ordered sequence of patterns y1, y2, y3 can be recycled if
necessary. Therefore, we have y4 = y1, y5 = y2, y6 = y3 , etc.
Using the fixed correction rule, an arbitrary selection of c = 1, and the initial weights chosen
arbitrary for each discrete perceptron as
1 0 1
𝐰11 1 1
= [−2], 𝐰2 = [−1] , 𝒘3 = [ 3 ]
0 2 −1
show that the final weight vectors are:
5 0 −9
𝐰19 = [3], 𝐰21 = [−1] , 𝒘23 = [ 1 ]
5 2 0
90
The three perceptron network obtained as a result of the training is shown in Figure 4.19. It
performs the following classification:
z1=sgn(5x1+3x2−5)
z2=sgn(− x2−2)
z3=sgn(− 9x1+x2)
The resulting surfaces are shown in Figure 4.20. Note that in constrast to the minimum-
distance classifier, this method has produced several indecision regions where no class
membership of an input pattern can be uniquely determined.
91
Hopfield's seminal papers in 1982 and 1984 (Hopfield 1982, 1984) were responsible for many
important applications in neural networks, especially in associative memory and optimisation
problems. His proposed Hopfield networks promoted construction of the first analogue VLSI
neural chip (Howard et al.1988).
The single-layer feedback network (Hopfield network) is shown in Figure 4.21. It consists of
n neurons having threshold values T. The updated output z* of the network can be found from
v=Wz+x−T (4.35)
𝒛∗ = 𝑓(𝒗) (4.36)
where z, x, T are the output vector, external input vector, and threshold vector respectively,
f(·) is the activation function, and W is the weight matrix (connectivity matrix)
92
𝑧1 𝑣1 𝑥1 𝑇1
𝑧 𝑣 𝑥 𝑇
𝐳 = [ … ], 𝐯 = [ … ] , 𝐱 = [ … ], 𝐓 = [ …2 ]
2 2 2 (4.37)
𝑧𝑛 𝑣𝑛 𝑥𝑛 𝑇𝑛
𝑓(𝑣1 )
𝐳 = 𝑓(𝐯) = [ 𝑓(𝑣 )
…2 ] (4.38)
𝑓(𝑣𝐾 )
0 𝑤12 ⋯𝑤1𝑛
𝑤21 0 ⋯𝑤2𝑛
𝐖=[ ] = [𝐰1 𝐰2 ⋯ 𝐰𝑛 ] (4.39)
⋮ ⋮ 0 ⋮
𝑤𝑛1 𝑤𝑛2 ⋯ 0
The weight matrix W is symmetrical and with zero diagonal entries (i.e. 𝑤𝑖𝑗 = 𝑤𝑗𝑖 , 𝑤𝑖𝑖 = 0)
Assuming that discrete perceptrons are used, for a discrete-time recurrent network, the
following update rules can be used.
For this update rule, the recursion starts at 𝐳 0 , which is the output vector corresponding to the
initial pattern submitted. The first iteration for k=1 results in 𝑧𝑖1 , where the neuron number i
is random. The other updates are also for random node number j where j i until all elements
of the vector 𝐳1 are updated.
Under this update mode, all n neurons of the layer are allowed to change their output
simultaneously.
93
Example 4.6
0 −1 0 0
𝐖=[ ], 𝐱 = [ ], and 𝐓 = [ ]
−1 0 0 0
−1
Set the initial output vector as 𝑧 0 = [ ]
−1
According to the asynchronous update rule, only one node is considered at a time. Assume
that the first node is chosen for update and the second node is considered next
1 1 1
𝑧1 = [ ], 𝑧 2 = [ ], 𝑧 3 = [ ] , …
−1 −1 −1
1
The state 𝒛 = [ ] is an equilibrium state of the network. Using different initial outputs, the
−1
1 −1
vectors 𝒛 = [ ] and 𝒛 = [ ] are the two equilibria of the system.
−1 1
1 −1 1
𝑧1 = [ ] , 𝑧 2 = [ ], 𝑧 3 = [ ] , …
1 −1 1
The synchronous update produces a cycle of two states rather than a single equilibrium state.
94
Example 4.7
95
References
Brown, M., Harris C. 1994, Neurofuzzy Adaptive Modelling and Control, Prentice
Hall, Hertfordshire.
Hertz, J., Krogh, A., Palmer, R.G. 1991, Introduction to the Theoary of Neural
Computing, Addison-Wesley, Redwood City, California.
Kosko, B. 1992, Neural Networks and Fuzzy Systems, Prentice Hall, New Jersey.
Smith, M. 1993, Neural Networks for Statistical Modelling, Van Nostrand Reinhold, New
York.
96
CHAPTER FIVE
MULTI-LAYER FEEDFORWRD NEURAL NETWORKS
_______________________________________________________
For training patterns which are linearly nonseparable, multi-layer networks (layered
networks) can be used. They can implement arbitrary complex input/output mappings or
decision surfaces separating pattern classes. The most important attribute of a multi-
layered feedforward network is that it can learn a mapping of any complexity using
repeated presentations of the training samples. The trained network often produces
surprising results and generalisations in applications where explicit derivation of
mappings and discovery of relationships is almost impossible.
Assume that two training sets Y1 and Y2 of augmented patterns are available for training.
If no weight vectors w exists such that
wTy >0, y Y1
(5.1)
w y <0, y Y2
T
However, it is possible to map the original pattern space into an image space so that a two-
layer network can eventually classify the patterns which are linearly nonseparable in the
original pattern space.
97
Example 5.1
The layered classifier shown in Figure 5.1 is designed to implement the following linearly
nonseparable patterns of XOR function
0 0 1 1
𝐱1 = [ ] , 𝐱 2 = [ ] , 𝐱 3 = [ ] , 𝐱 4 = [ ]
0 1 1 0
The arbitrary selected partitioning is provided by the two decision lines in Figure 5.2
having equations
−2𝑥1 + 𝑥2 − 0.5 = 0
𝑥1 − 𝑥2 − 0.5 = 0
The first layer provides an appropriate mapping of patterns into images in the image
space. The second layer implements the classification of the images rather than of the
original patterns. Note that both input patterns A and D collapse in the image space into
a single image (−1, −1).
An arbitrary decision line providing the desired classification and separating A, D and
B, C in the image space as shown in Figure 5.3 has been selected as
98
z1+z2+1=0
99
The input and output values of the network are denoted y and z respectively. Using the vector
notation, the forward pass in the network can be expressed as follows
where the input vector y, desired output vector d, output vector z, and the weight matrix W
are respectively
𝑦1 𝑑1 𝑧1 𝑤11 ⋯ 𝑤1𝐽
𝑦2 𝑧
𝐲 = [ … ], 𝐝 = [ 𝑑…2 ], 𝐳 = [ …2 ], 𝐖 = [ ⋮ ⋱ ⋮ ] (5.3)
𝑦𝐽 𝑑𝐾 𝑧𝐾 𝑤𝐾1 ⋯ 𝑤𝐾𝐽
𝑓(𝑣1 )
(v) = [ 𝑓(𝑣 )
…2 ] (5.4)
𝑓(𝑣𝐾 )
100
1
𝐸 = ∑𝐾
𝑘=1(𝑑𝑘 − 𝑧𝑘 )
2
(5.5)
2
Assume that the gradient descent search is performed to reduce the error E adjustment
of weights. The weight adjustment can be expressed as follows:
𝜕𝐸 𝜕𝐸 𝜕𝑣𝑘 𝜕𝐸
∆𝑤𝑘𝑗 = −𝜂 = −𝜂 = −𝜂 𝑦
𝜕𝑣𝑘 𝑗
= 𝜂𝑘 𝑦𝑗 (5.6)
𝜕𝑤𝑘𝑗 𝜕𝑣𝑘 𝜕𝑤𝑘𝑗
𝜕𝐸 𝜕𝐸 𝜕𝑧𝑘
𝑘 = − =− = (𝑑𝑘 − 𝑧𝑘 )𝑓′(𝑣𝑘 ) (5.7)
𝜕𝑣𝑘 𝜕𝑧𝑘 𝜕𝑣𝑘
The updated weights under the delta training rule can be found from
𝜕𝐸
𝐖 ∗= 𝐖 − 𝜂
𝜕𝐖
= 𝐖 + 𝜂y 𝑇 (5.8)
where
1
= [ 2]
…
𝐾
Chapter 5.3 Generalised Delta Learning Rule (Error Back Propagation Training)
Consider a two-layer network (or three-node layer network) as shown in Figure 5.5.
Layers with neurons whose outputs are not directly accessible are called hidden layers.
The negative gradient descent for the hidden layer now can be found from
𝜕𝐸 𝜕𝐸 𝜕𝑣̅𝑗 𝜕𝐸
̅𝑗𝑖 = −𝜂 𝜕𝑤̅ = −𝜂 𝜕𝑣̅
∆𝑤 = 𝜂 𝑥𝑖 (5.9)
𝑗𝑖 𝑗 𝜕𝑤
̅ 𝑗𝑖 𝜕𝑣̅𝑗
Note that
𝜕𝐸 𝜕 1 𝜕𝐸𝑘 𝜕𝑧𝑘 𝜕𝑣𝑘
𝜕𝑦𝑗
= [ ∑𝐾 2 𝐾
𝑘=1(𝑑𝑘 −𝑧𝑘 ) ] = − ∑𝑘=1 (5.11)
𝜕𝑦𝑗 2 𝜕𝑧𝑘 𝜕𝑣𝑘 𝜕𝑦𝑗
𝜕𝐸 𝜕𝑧𝑘
𝜕𝑦𝑗
= − ∑𝐾
𝑘=1(𝑑𝑘 −𝑧𝑘 ) 𝑤𝑘𝑗 = − ∑𝐾
𝑘=1 𝑘 𝑤𝑘𝑗 (5.12)
𝜕𝑣𝑘
or
̅1
𝜕𝑦𝑗
̅𝑗 = ∑𝐾
𝜕𝑣̅𝑗 𝑘=1 𝑘
𝑤𝑘𝑗 = (∑𝐾
𝑘=1 𝑘 𝑤𝑘𝑗 )𝑓′(𝑣̅𝑗 ) where ̅ = [ ⋮ ]
̅𝐽−1
𝐖 ̅ − 𝜕𝐸 = 𝐖
̅ ∗= 𝐖 ̅ −̅𝐱 𝑇 where dim(𝐖
̅ )=(J−1, 1) (5.14)
𝜕𝐖
102
Example 5.2
The layered classifier shown in Figure 5.6 is trained to solve the following linearly
nonseparable patterns of XOR function.
0 0 1 1
𝐱1 = [ ] , 𝐱 2 = [ ] , 𝐱 3 = [ ] , 𝐱 4 = [ ]
0 1 1 0
𝑑1 = [−1], 𝑑2 = [1], 𝑑3 = [−1], 𝑑4 = [1]
All continuous perceptrons use the bipolar activation functions. Assume that = 0.1, a set
of initial random weights was found to provide correct solutions. The initial weight matrices
̅ 1 and the resulting weight matrices W𝑓 , W
W1 , W ̅ 𝑓 obtained after 250 cycles (1000 steps) are:
̅ 1 = [−6.9938
𝑊
6.6736 1.5555 ̅ 2
], 𝑊 = [
−6.9938 6.6736 1.5422
]
−4.2812 3.9127 3.6233 −4.2812 3.9127 3.6244
̅ 𝑓 = [−6.1974
𝐖
7.4970 −1.3308
]
−4.7861 5.2825 3.0159
103
Using the final weight matrices, input all the pattern x1 to x4, then the actual outputs 𝐳 =
[−0.7784 0.7678 −0.8743 0.7498]. Replacing the continuous output perceptron
with discrete output perceptron , e.g. TLU, the actual outputs 𝐳 = [−1 1 −1 1].
Since the network from Figure 5.6 is required to function as a classifier with binary
outputs, the continuous perceptrons should be replaced with discrete perceptrons.
1
𝐸 = ∑𝐾
𝑘=1(𝑑𝑘 − 𝑧𝑘 )
2
(5.15)
2
The accumulative error (cycle error) Ec can be calculated over the error back-propagation
training cycle of a network with P training patterns and K output neurons as:
1 2
𝐸𝑐 = ∑𝑃𝑝=1 ∑𝐾 𝑃
𝑘=1(𝑑𝑝𝑘 − 𝑧𝑝𝑘 ) = ∑𝑝=1 𝐸𝑝 (5.16)
2
104
1 2 1
𝐸𝑟𝑚𝑠 = √∑𝑃𝑝=1 ∑𝐾
𝑘=1(𝑑𝑝𝑘 − 𝑧𝑝𝑘 ) = √2𝐸𝑐 (5.17)
𝑃𝐾 𝑃𝐾
The essence of the error back-propagation algorithm is the evaluation of the contribution of
each particular weight to the output error. This is often referred to as the problem of credit
assignment. One of the problems in the implementation of the algorithm is that it may produce
only a local minimum of the error function as illustrated in Figure 5.8. For a local minimum
to exist, all the weights must simultaneously be at a value from which a change in either
direction will increase the error. Although the negative gradient descent technique can
become stuck in local minima of the error function, in general these minima are not very deep,
it is usually sufficient to get out of local minima by inserting some form of randomness to the
training.
105
The weights of the network to be trained are typically initialised at small random values.
The initialisation strongly affects the ultimate solution. The network may fail to learn the
training set with the error stabilising or even increasing as the learning continues. The
network learning should then be restarted with other random weights. In Figure 5.8, there
are 3 starting points (initial weights), only point 1 can meet the training goal (reach to
Erms,min). Points 2 and 3 are tapped into local minima.
Using the same Example 5.2, the learning rate of this example is set at 0.1 (=0.1),
compare the learning performance with different learning rates ( =0.01, 0.8, 5). A good
learning rate was found to be =0.8. The cycle errors for different learning rate is shown
in Figure 5.9.
Figure 5.9 The cycle errors for different learning rate ( =0.01, 0.1, 0.8, 5)
106
The method of adaptive learning rates is much faster than steepest descent and is also very
dependable. Let there be a different learning rate for each weight in the network.
If the direction in which the error decreases at this weight change is the same as the
direction in which it has been decreasing recently, make larger.
If the direction in which the error currently decreases is the opposite of the recent direction,
make smaller.
The momentum method may be used to accelerate the convergence of the error back
propagation learning algorithm. The method involves supplementing the current weight
adjustments with a fraction of the most recent weight adjustment. The back-propagation
(steepest descent) algorithm with momentum term is:
Typically, is chosen between 0.1 and 0.9. The momentum term typically helps to speed
up convergence, and to achieve an efficient and more reliable profile. Note that 𝐰 0 =
𝐰1 .
107
Example 5.3
Using the set of initial random weights as in Example 5.2, a learning rate of = 0.8, a good
momentum constant was found to be ̅ 1 and the
= 0.4 . The initial weight matrices W1 , W
̅ 𝑓 obtained after 250 cycles (1000 steps) are:
resulting weight matrices W𝑓 , W
̅ 𝑓 = [−6.3004
𝐖
7.4248 −1.8632
]
−5.1667 5.7364 3.1251
back-propagation
without momentum
back-propagation
with momentum
Figure 5.10 The Cycle Error for back-propagation learning with momentum
Using the same example, different momentum constant values ( = 0.1, 0.4, 0.9) are chosen,
a good momentum constant was found to be = 0.4.
Figure 5.11 The Cycle Error for back-propagation learning with momentum with different
momentum constant values ( = 0.1, 0.4, 0.9)
108
Consider a network with I input nodes, a single hidden layer of J neurons, and an output layer
consisting of K neurons as shown in Figure 5.12. The number of input nodes is simply
determined by the dimension of the input vector to be classified which is usually corresponds
to the number of distinct features of the input pattern.
In the case of planar images, the size of the input vector is usually equal to the number of pixels
in the evaluated image. For example, the characters C, I, T can be represented on a 3 x 3 grid as
shown in Figure 5.13.
109
Assume that the size of the input vector is 9, three training vectors are required:
x2 =[−1 1 −1 −1 1 −1 −1 1 −1 ]T :Class I
For networks functioning as a classifier, the number of output neurons K can be made equal
to the number of classes. In such cases (local representation), the network would also perform
as a class decoder. Thus, number of output is K.
The size of hidden layer is one of the most important considerations when solving actual
problems using multi-layer feed-forward networks. Assume that the n-dimensional non-
augmented input space is linearly separable into M disjoint regions with boundaries being
parts of hyperplanes. Each of the M regions can be labelled as belonging to one of the R
classes, where RM .
110
Figure 5.14 shows an example separation for n=9, R=3 and M=7. Intuitively, the number of
separation regions M should be the lower bound on the size P of the training set ( P M ) .
There exists a relationship between M, J* and n. The maximum number of regions linearly
separable using J* hidden neurons in n-dimensional input space is given by:
(5.19)
(5.20)
For large size input vectors compared to the number of hidden nodes J*, or n J*, the number
of hidden nodes can be calculate from
(5.22)
Example 5.4
Design and implement the classification of three printed characters C, I, and T as shown in
Figure using a single hidden layer network. The three input vectors and the target vectors
in the training set are chosen as:
x2 =[−1 1 −1 −1 1 −1 −1 1 −1 ]T :Class I
1 −1 −1
𝐝𝟏 = [−1] , 𝐝𝟐 = [ 1 ] , 𝐝𝟑 = [−1]
−1 −1 1
As shown in Figure 5.13, there are 3 classes (R=3) and then number of output neurons is 3
(K=3), the number of separation regions is 7 (M=7). The number of hidden nodes can be
calculated from equation M=2J* , 𝐽∗ = ⌈√7⌉, then J*= 3. Note that due to the necessary
augmentation of inputs and of the hidden layer by one fixed input, the trained network should
111
have 10 input nodes (I=10), 4 hidden nodes (J=4), 3 output nodes (K=3). Therefore
̅ )=(J−1,I )=(3,10)
dim(W)=(K,J)=(3,4) and dim(𝐖
In this example, all continuous perceptrons use the bipolar activation function. Using a set of
̅1
initial random weights and a learning constant of =0.8, The initial weight matrices 𝑊 1 , 𝑊
̅ 𝑓 obtained after 60 steps are:
and the final weight matrices 𝑊 𝑓 , 𝑊
112
In pattern mode training, the output weight matrix W and hidden weight matrix W are
updated after each pattern is presented. In batch mode training, increments are accumulated
and updating takes place at the end of each cycle (epoch) after all patterns have been
presented. Essentially, pattern mode training performs stochastic gradient or incremental
gradient search whereas batch mode training performs average gradient descent in the
weight space.
Examples 5.5:
The layered classifier shown in Figure 5.6 is trained in batch mode to solve the following
linearly nonseparable patterns of XOR function
0 0 1 1
𝐱1 = [ ] , 𝐱 2 = [ ] , 𝐱 3 = [ ] , 𝐱 4 = [ ]
0 1 1 0
𝑑1 = [−1], 𝑑2 = [1], 𝑑3 = [−1], 𝑑4 = [1]
Using the set of initial random weights as in Example 5.2, a good learning constant was
foundto be =0.8. The initial weight matrices are same as Example 5.2, and the resulting
̅ 𝑓 obtained after 50 cycles (200 steps) are:
(final) matrics 𝐖𝑓 and 𝐖
𝐖𝑓 = [−3.1856 3.4354 − 2.7244]
̅ 𝑓 = [−6.1454
𝐖
7.5729 −1.6690
]
−4.8755 5.4606 2.9841
Figure 5.16 shows the cycle error for Example 5.5.
113
In a typical situation, the mean-square error decreases with an increasing number of cycles
(or epochs) during training. With good generalisation as the goal, it is difficult to determine
when it is best to stop training if we only look at the learning curve. It is possible for the
network to end up overfitting the training data if the training session is not stopped at the right
point.
We may identify the onset of overfitting through the use of cross-validation in which the
training data are split into an estimation subset and a validation subset. The estimation subsets
is used to train the network in the usual way, except that the training session is stopped
periodically (every so many cycles), and the network is tested on the validation subset after
each period of training.
114
References
Brown, M., Harris C. 1994, Neurofuzzy Adaptive Modelling and Control, Prentice
Hall, Hertfordshire.
Hertz, J., Krogh, A., Palmer, R.G. 1991, Introduction to the Theoary of Neural
Computing, Addison-Wesley, Redwood City, California.
Kosko, B. 1992, Neural Networks and Fuzzy Systems, Prentice Hall, New Jersey.
Smith, M. 1993, Neural Networks for Statistical Modelling, Van Nostrand Reinhold, New
York.
Trieu, H.T., Nguyen, H.T., Willey, K. 2008, ‘Advanced Obstacle Avoidance for a Laser-
based Wheelchair using Optimised Bayesian Neural Networks’, 30th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society,
20-24 August 2008, Vancouver, Canada, pp. 3463-3466.
115
CHAPTER SIX
INTRODUCTION TO CONVOLUTIONAL NEURAL
NETWORKS
_______________________________________________________
As shown in the previous chapter, neural networks receive an input (a single vector), and
transform it through a series of hidden layers. Each hidden layer is made up of a set of
neurons, where each neuron is fully connected to all neurons in the previous layer, and where
neurons in a single layer function completely independently and do not share any
connections. The last fully-connected layer is called the “output layer” and in classification
settings it represents the class scores.
While regular neural networks do not scale well to full images. For example, for images are
only of size 32×32×3 (32 wide, 32 high, 3 color channels), a single fully-connected neuron
in a first hidden layer of a regular neural network would have 32*32*3 = 3072 weights. This
amount still seems manageable, but clearly this fully-connected structure does not scale to
larger images. For example, an image of more respectable size, e.g. 200×200×3, would lead
to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly
want to have several such neurons, so the parameters would add up quickly! Clearly, this
full connectivity is wasteful, and the huge number of parameters would quickly lead to
overfitting. On the other hand, the spatial and structural correlations of the image pixels are
broken if flattening an image to a single vector as input. Therefore, the convolutional neural
networks (CNNs) are proposed.
Convolutional neural networks or CNNs, are a specialized kind of neural network for
processing data that has a known, grid-like topology. Convolutional Neural Networks are
very similar to ordinary neural networks from the previous chapter: they are made up of
neurons that have learnable weights and biases. Each neuron receives some inputs, performs
a dot product and optionally follows it with a non-linearity. The whole network still
expresses a single differentiable score function: from the raw image pixels on one end to
class scores at the other. And they still have a loss function (e.g. Softmax) on the last (fully-
116
connected) layer and all the tips/tricks we developed for learning regular neural networks
still apply.
CNN architectures make the explicit assumption that the inputs are images, which allows us
to encode certain properties into the architecture. These then make the forward function
more efficient to implement and vastly reduce the amount of parameters in the network.
Convolutional networks have been tremendously successful in practical applications.
The typical CNNs consist of three main layers: convolutional layer, pooling layer and fully-
connected layer. As shown in Figure 6.1, it is a classical CNN model, LeNet5. In this Figure,
C1, C3 and C5 are three convolution layers which compute the output of neurons that are
connected to local regions. S2 and S4 present the pooling/subsampling layers. F6 is a fully-
connected layer to compute the class scores. As last, the output can be obtained by the
classifier. In this chapter, we mainly introduce the Softmax classifier. We now describe the
individual layers and the details of their hyperparameters and their connectivities.
The convolutional layer is the core building block of a convolutional network that does most
of the computational heavy lifting. It has three main characteristics, local connectivity,
spatial arrangement and parameter sharing.
117
(a) (b)
Figure 6.2 Local Connectivity
As shown in the Figure 6.2 (b), the neurons from the neural network chapter remain
unchanged: They still compute a dot product of their weights with the input followed by a
non-linearity, but their connectivity is now restricted to be local spatially.
118
We have explained the connectivity of each neuron in the convolutional layer to the input
volume, but we haven’t yet discussed how many neurons there are in the output volume or
how they are arranged. Three hyperparameters control the size of the output volume: the
depth, stride and zero-padding. We discuss these next:
Depth: the depth of the output volume is a hyperparameter: it corresponds to the number of
filters we would like to use, each learning to look for something different in the input. For
example, if the first convolutional layer takes as input the raw image, then different neurons
along the depth dimension may activate in presence of various oriented edges, or blobs of
colour. We will refer to a set of neurons that are all looking at the same region of the input
as a depth column (some people also prefer the term fibre).
Stride: we must specify the stride with which we slide the filter. When the stride is 1 then
we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more,
though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around.
This will produce smaller output volumes spatially.
Zero-padding: As we will soon see, sometimes it will be convenient to pad the input volume
with zeros around the border. The size of this zero-padding is a hyperparameter. The nice
feature of zero padding is that it will allow us to control the spatial size of the output volumes
(most commonly as we’ll see soon we will use it to exactly preserve the spatial size of the
input volume so the input and output width and height are the same).
We can compute the spatial size of the output volume as a function of the input volume size
(W), the receptive field size of the convolution layer neurons (F), the stride with which they
are applied (S), and the amount of zero padding used (P) on the border. You can convince
yourself that the correct formula for calculating the size of output volume Nout is given by:
(𝑊−𝐹+2𝑃)
𝑁𝑜𝑢𝑡 = +1 (6.1)
𝑆
119
For example, for a 7×7 input and a 3×3 filter with stride S=1 and zero padding P=0, we
would get a 5×5 output. With stride S=2 we would get a 3×3 output. The process is shown
in Figure 6.3.
In this example there is only one spatial dimension (x-axis), one neuron with a receptive field
size of F = 3, the input size is W = 5, and there is zero padding of P = 1. In Figure 6.3(a): The
neuron strides across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. In
Figure 6.3(b): The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3.
Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In
terms of the equation (6.1), this can be determined since (5 - 3 + 2) = 4 is not divisible by 3.
The neuron weights are in this example [1, 0, −1] (shown in Fig. 6.3(c)), and its bias is zero.
These weights are shared across all yellow neurons (topper layer) in Figure 6.3.
Use of zero-padding: In the example of Figure 6.2(a), the input dimension was 5 and the
output dimension was equal (also 5). This worked out so because our receptive fields were 3
and we used zero padding of 1. If there was no zero-padding used, then the output volume
would have had spatial dimension of only 3, because that it is how many neurons would have
“fit” across the original input. In general, setting zero padding to be P=(F−1)/2 when the stride
is S=1 ensures that the input volume and output volume will have the same size spatially.
Constraints on strides: Note again that the spatial arrangement hyperparameters have mutual
constraints. For example, when the input has size W=10, no zero-padding is used P=0, and
the filter size is F=3, then it would be impossible to use stride S=2, since
(W−F+2P)/S+1=(10−3+0)/2+1=4.5, i.e. not an integer, indicating that the neurons don’t “fit”
120
neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is
considered to be invalid, and a CNN library could throw an exception or zero pad the rest to
make it fit, or crop the input to make it fit, or something. Sizing the CNN appropriately so
that all the dimensions “work out” can be a real headache, which the use of zero-padding and
some design guidelines will significantly alleviate.
Parameter sharing refers to using the same parameter for more than one function in a model.
In a traditional neural net, each element of the weight matrix is used exactly once when
computing the output of a layer. It is multiplied by one element of the input and then never
revisited. As a synonym for parameter sharing, one can say that a network has tied weights,
because the value of the weight applied to one input is tied to the value of a weight applied
elsewhere. In a convolutional neural net, each member of the kernel is used at every position
of the input (except perhaps some of the boundary pixels, depending on the design decisions
regarding the boundary). The parameter sharing used by the convolution operation means
that rather than learning a separate set of parameters for every location, we learn only one
set.
It turns out that we can dramatically reduce the number of parameters by making one
reasonable assumption: That if one feature is useful to compute at some spatial position (x, y),
then it should also be useful to compute at a different position (x2, y2). In other words, denoting
a single 2-dimensional slice of depth as a depth slice (e.g. a volume of size [55×55×96] has
96 depth slices, each of size [55×55]), we are going to constrain the neurons in each depth
slice to use the same weights and bias. With this parameter sharing scheme, the first
convolutional layer in our example would now have only 96 unique set of weights (one for
each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters
121
(+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same
parameters. In practice during backpropagation, every neuron in the volume will compute the
gradient for its weights, but these gradients will be added up across each depth slice and only
update a single set of weights per slice.
Notice that if all neurons in a single depth slice are using the same weight vector, then the
forward pass of the conv layer can in each depth slice be computed as a convolution of the
neuron’s weights with the input volume (Hence the name: Convolutional Layer). This is why
it is common to refer to the sets of weights as a filter (or a kernel), that is convolved with the
input.
Note that sometimes the parameter sharing assumption may not make sense. This is especially
the case when the input images to a CNN have some specific centred structure, where we
should expect, for example, that completely different features should be learned on one side
of the image than another. One practical example is when the input are faces that have been
centred in the image. You might expect that different eye-specific or hair-specific features
could (and should) be learned in different spatial locations. In that case it is common to relax
the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
122
❖ In the output volume, the d-th depth slice (of size W2×H2) is the result of performing
a valid convolution of the d-th filter over the input volume with a stride of S, and
then offset by d-th bias.
A common setting of the hyperparameters is F=3, S=1, P=1. However, there are common
conventions and rules of thumb that motivate these hyperparameters.
Example 6.1
This example shows how to get the value in the feature map based on the receptive field in
the input and the weights (convolution filter). We can see that the destination pixel =
∑9𝑖=1 𝑤𝑖 𝑥𝑖 =((−1×3)+ (0×0)+ (1×1)+ (−2×2)+ (0×6)+ (2×2)+ (−1×2)+ (0×4)+ (1×1))= −3.
Example 6.2
For an input image with size 5×5×3 (3 colour channels), since 3D volumes are hard to
visualize, all the volumes are visualized with each depth slice stacked in rows. The input
volume is of size W1=5, H1=5, D1=3, and the convolutional layer parameters are K=2, F=3,
S=2, P=1. That is, we have 2 filters of size 3×3 with a stride of 2. Therefore, the output volume
size has spatial size (5 − 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of P=1 is applied
to the input volume, making the outer border of the input volume zero. The visualization
123
below iterates over the output activations, and shows that each element is computed by
elementwise multiplying the highlighted input with the filter, summing it up, and then
offsetting the result by the bias. Now, the size of input volume with zero-padding (P=1) is
7×7×3 and the three input images (with pad 1) are:
A typical layer of a convolutional network consists of three stages. In the first stage, the layer
performs several convolutions in parallel to produce a set of linear activations. In the second
stage, each linear activation is run through a nonlinear activation function, such as the rectified
linear activation function. This stage is sometimes called the detector stage. In the third stage,
we use a pooling function to modify the output of the layer further.
124
A pooling function replaces the output of the net at a certain location with a summary statistic
of the nearby outputs. For example, the max pooling operation reports the maximum output
within a rectangular neighbourhood. Other popular pooling functions include the average of
a rectangular neighbourhood, the L2 norm of a rectangular neighbourhood, or a weighted
average based on the distance from the central pixel. The Figure 6.4 shows an example of the
process of max pooling. Pooling layer downsamples the volume spatially, independently in
each depth slice of the input volume. In Figure 6.4(a), the input volume of size [224×224×
64] is pooled with filter size 2, stride 2 into output volume of size [112×112×64]. Notice that
the volume depth is preserved. In Figure 6.4(b), the most common downsampling operation
is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken
over 4 numbers (little 2×2 square).
(a) (b)
Figure 6.4 Illustration of max pooling.
In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a
small amount, the values of most of the pooled outputs do not change. See Figure 6.5 for an
example of how this works. Figure 6.5(a) shows a view of the middle of the output of a
convolutional layer. The bottom row shows outputs of the nonlinearity. The top row shows
the outputs of max pooling, with a stride of one pixel between pooling regions and a pooling
region width of three pixels. Figure 6.5(b) shows a view of the same network, after the input
has been shifted to the right by one pixel. Every value in the bottom row has changed, but
125
only half of the values in the top row have changed, because the max pooling units are only
sensitive to the maximum value in the neighbourhood, not its exact location.
Invariance to local translation can be a very useful property if we care more about whether
some feature is present than exactly where it is. For example, when determining whether an
image contains a face, we need not know the location of the eyes with pixel-perfect accuracy,
we just need to know that there is an eye on the left side of the face and an eye on the right
side of the face. In other contexts, it is more important to preserve the location of a feature.
For example, if we want to find a corner defined by two edges meeting at a specific orientation,
we need to preserve the location of the edges well enough to test whether they meet.
(a)
(b)
Figure 6.5 Max pooling introduces invariance.
The use of pooling can be viewed as adding an infinitely strong prior that the function the
layer learns must be invariant to small translations. When this assumption is correct, it can
greatly improve the statistical efficiency of the network.
Pooling over spatial regions produces invariance to translation, but if we pool over the outputs
of separately parametrized convolutions, the features can learn which transformations to
become invariant to (see Figure. 6.6). A pooling unit that pools over multiple features that are
learned with separate parameters can learn to be invariant to transformations of the input.
Here we show how a set of three learned filters and a max pooling unit can learn to become
126
invariant to rotation. All three filters are intended to detect a hand-written “5”. Each filter
attempts to match a slightly different orientation of the “5”. When a “5” appears in the input,
the corresponding filter will match it and cause a large activation in a detector unit. The max
pooling unit then has a large activation regardless of which pooling unit was activated. We
show here how the network processes two different inputs, resulting in two different detector
units being activated. The effect on the pooling unit is roughly the same either way.
Because pooling summarizes the responses over a whole neighbourhood, it is possible to use
fewer pooling units than detector units, by reporting summary statistics for pooling regions
spaced k pixels apart rather than 1 pixel apart. This improves the computational efficiency of
the network because the next layer has roughly k times fewer inputs to process. When the
number of parameters in the next layer is a function of its input size (such as when the next
layer is fully connected and based on matrix multiplication) this reduction in the input size
can also result in improved statistical efficiency and reduced memory requirements for storing
the parameters. For many tasks, pooling is essential for handling inputs of varying size. For
example, if we want to classify images of variable size, the input to the classification layer
must have a fixed size. This is usually accomplished by varying the size of an offset between
pooling regions so that the classification layer always receives the same number of summary
statistics regardless of the input size. For example, the final pooling layer of the network may
be defined to output four sets of summary statistics, one for each quadrant of an image,
regardless of the image size.
127
Neurons in a fully connected layer have full connections to all activations in the previous
layer, as seen in traditional neural networks (See Chapters 4-5). It can view as the final
learning phase, which maps extracted visual features to desired outputs. The output of fully-
connected layer is a vector, which is then passed through softmax to represent confidence of
classification.
6.2.4 Softmax
Softmax is a special kind of activation layer, usually at the end of fully-connected layer
outputs. It can be viewed as a fancy normalizer (a.k.a. Normalized exponential function).
It produces a discrete probability distribution vector and is very convenient when
combined with cross-entropy loss.
Given sample vector input x and weight vectors {wj}, the predicted probability of y = j
can be calculated by:
𝐰 𝑇𝐱
𝑒 𝑗
𝑃(𝑦 = 𝑗 | 𝐱) = 𝐰𝑘 𝑇 𝐱
(6.2)
∑𝐾
𝑘=1 𝑒
Figure 6.7 shows an example of the process of Softmax classifier, after calculating and
comparing the probability of each colour, the green colour has a highest probability value
compared with the other colours, and thus the final output of the classifier is green.
128
We have previously introduced the gradient descent algorithm that follows the gradient of an
entire training set downhill. This may be accelerated considerably by using stochastic gradient
descent to follow the gradient of randomly selected minibatches downhill when the data size
is large.
Stochastic gradient descent (SGD) and its variants are probably the most used optimization
algorithms for machine learning in general and for deep learning in particular. It is possible
to obtain an unbiased estimate of the gradient by taking the average gradient on a minibatch
of m examples. Algorithm 6.1 shows how to follow this estimate of the gradient downhill.
̂
Apply update: 𝜽 ← 𝜽 − ϵ𝒈
end while
A crucial parameter for the SGD algorithm is the learning rate. Previously, we have described
SGD as using a fixed learning rate ϵ. In practice, it is necessary to gradually decrease the
learning rate over time, so we now denote the learning rate at iteration k as ϵk.
The most important property of SGD and related minibatch or online gradient based
optimization is that computation time per update does not grow with the number of training
examples. This allows convergence even when the number of training examples becomes very
129
large. For a large enough dataset, SGD may converge to within some fixed tolerance of its
final test set error before it has processed the entire training set.
6.3.2 Momentum
While stochastic gradient descent remains a very popular optimization strategy, learning with
it can sometimes be slow. The method of momentum is designed to accelerate learning,
especially in the face of high curvature, small but consistent gradients, or noisy gradients. The
momentum algorithm accumulates an exponentially decaying moving average of past
gradients and continues to move in their direction.
Formally, the momentum algorithm introduces a variable v that plays the role of velocity—it
is the direction and speed at which the parameters move through parameter space. The
velocity is set to an exponentially decaying average of the negative gradient. The name
momentum derives from a physical analogy, in which the negative gradient is a force moving
a particle through parameter space, according to Newton’s laws of motion. Momentum in
physics is mass times velocity. In the momentum learning algorithm, we assume unit mass,
so the velocity vector v may also be regarded as the momentum of the particle. A
hyperparameter α∈[0 ,1) determines how quickly the contributions of previous gradients
exponentially decay. The update rule is given by:
1
𝒗 ← α𝒗 − ϵ∇𝜽 ( ∑𝑚 𝐿(𝑓(𝒙(𝑖) ; 𝜽), 𝒚(𝑖) )) (6.3)
𝑚 𝑖=1
𝜽 ← 𝜽+𝒗 (6.4)
1
The velocity v accumulates the gradient elements ∇𝜽 ( ∑𝑚 𝐿(𝑓(𝒙(𝑖) ; 𝜽), 𝒚(𝑖) )). The
𝑚 𝑖=1
larger α is relative to ϵ, the more previous gradients affect the current direction. The SGD
algorithm with momentum is given in Algorithm 6.2.
130
Previously, the size of the step was simply the norm of the gradient multiplied by the
learning rate. Now, the size of the step depends on how large and how aligned a sequence
of gradients are. The step size is largest when many successive gradients point in exactly
the same direction. If the momentum algorithm always observes gradient g, then it will
accelerate in the direction of −g, until reaching a terminal velocity where the size of each
𝜖‖𝒈‖
step is 1−𝛼
.
1
It is thus helpful to think of the momentum hyperparameter in terms of . For example,
1−𝛼
Common values of α used in practice include 0.5, 0.9, and 0.99. Like the learning rate, α
may also be adapted over time. Typically, it begins with a small value and is later raised.
It is less important to adapt α over time than to shrink ϵ over time.
131
Some heuristics are available for choosing the initial scale of the weights. One heuristic
is to initialize the weights of a fully connected layer with m inputs and n outputs by
1 1
sampling each weight from 𝑈 (− , ) , while Glorot and Bengio (2010) suggest using
√m √ m
6 6
W𝑖,𝑗 ~ 𝑈 (− , ) (6.5)
√𝑚+𝑛 √𝑚+𝑛
This latter heuristic is designed to compromise between the goal of initializing all layers
to have the same activation variance and the goal of initializing all layers to have the same
gradient variance. The formula is derived using the assumption that the network consists
only of a chain of matrix multiplications, with no nonlinearities. Real neural networks
obviously violate this assumption, but many strategies designed for the linear model
perform reasonably well on its nonlinear counterparts.
Neural network researchers have long realized that the learning rate was reliably one of
the hyperparameters that is the most difficult to set because it has a significant impact on
model performance. Because the cost is often highly sensitive to some directions in
parameter space and insensitive to others. The momentum algorithm can mitigate these
issues somewhat but does so at the expense of introducing another hyperparameter. In the
face of this, it is natural to ask if there is another way. If we believe that the directions of
sensitivity are somewhat axis-aligned, it can make sense to use a separate learning rate for
each parameter, and automatically adapt these learning rates throughout the course of
learning.
132
6.3.4.1 AdaGrad
The AdaGrad algorithm, shown in Algorithm 6.3, individually adapts the learning rates of
all model parameters by scaling them inversely proportional to the square root of the sum
of all of their historical squared values. The parameters with the largest partial derivative
of the loss have a correspondingly rapid decrease in their learning rate, while parameters
with small partial derivatives have a relatively small decrease in their learning rate. The
net effect is greater progress in the more gently sloped directions of parameter space.
In the context of convex optimization, the AdaGrad algorithm enjoys some desirable
theoretical properties. However, empirically it has been found that—for training deep
neural network models—the accumulation of squared gradients from the beginning of
training can result in a premature and excessive decrease in the effective learning rate.
AdaGrad performs well for some but not all deep learning models.
wise)
Apply update: 𝜽 ← 𝜽 + ∆𝜽
end while
133
6.3.4.2 RMSProp
The RMSProp algorithm modifies AdaGrad to perform better in the non-convex setting
by changing the gradient accumulation into an exponentially weighted moving average.
AdaGrad is designed to converge rapidly when applied to a convex function. When
applied to a non-convex function to train a neural network, the learning trajectory may
pass through many different structures and eventually arrive at a region that is a locally
convex bowl. AdaGrad shrinks the learning rate according to the entire history of the
squared gradient and may have made the learning rate too small before arriving at such a
convex structure. RMSProp uses an exponentially decaying average to discard history
from the extreme past so that it can converge rapidly after finding a convex bowl, as if it
were an instance of the AdaGrad algorithm initialized within that bowl.
RMSProp is shown in its standard form in Algorithm 6.4. Compared to AdaGrad, the use
of the moving average introduces a new hyperparameter, ρ, that controls the length scale
of the moving average.
Apply update: 𝜽 ← 𝜽 + ∆𝜽
end while
6.3.4.3 Adam
Adam is yet another adaptive learning rate optimization algorithm and is presented in
Algorithm 6.5. The name “Adam” derives from the phrase “adaptive moments.” In the
context of the earlier algorithms, it is perhaps best seen as a variant on the combination of
RMSProp and momentum with a few important distinctions. First, in Adam, momentum
is incorporated directly as an estimate of the first order moment (with exponential
weighting) of the gradient. The most straightforward way to add momentum to RMSProp
is to apply momentum to the rescaled gradients. The use of momentum in combination
with rescaling does not have a clear theoretical motivation. Second, Adam includes bias
corrections to the estimates of both the first-order moments (the momentum term) and the
(uncentered) second-order moments to account for their initialization at the origin (see
Algorithm 6.5). RMSProp also incorporates an estimate of the (uncentered) second-order
moment, however it lacks the correction factor. Thus, unlike in Adam, the RMSProp
second-order moment estimate may have high bias early in training. Adam is generally
regarded as being fairly robust to the choice of hyperparameters, though the learning rate
sometimes needs to be changed from the suggested default.
At this point, a natural question is: which algorithm should one choose? Unfortunately,
there is currently no consensus on this point. Currently, the most popular optimization
algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with
momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point,
seems to depend largely on the user’s familiarity with the algorithm (for ease of
hyperparameter tuning).
135
t←t+1
Update biased first moment estimate: 𝒔 ← 𝜌1 𝒔 + (1 − 𝜌1 )𝒈
Update biased second moment estimate: 𝒓 ← 𝜌2 𝒓 + (1 − 𝜌2 )𝒈 ⊙ 𝒈
𝒔
Correct bias in first moment: 𝒔̂ ←
1−𝜌1𝑡
𝒓
Correct bias in second moment: 𝒓̂ ←
1−𝜌2𝑡
𝒔̂
Compute update: ∆𝜽 = −𝜖 𝛿+√𝒓̂ (operations applied element-wise)
Apply update: 𝜽 ← 𝜽 + ∆𝜽
end while
Batch normalization (BN) makes networks robust to bad initialization of weights and
usually inserted right before activation layers. It can be able to reduce covariance shift by
normalizing and scaling inputs. The scale and shift parameters are trainable to avoid losing
stability of the network. The Algorithm 6.6 shows the BN transform which is applied to
activation x over a mini-batch.
136
𝑥𝑖 −μΒ
𝑥̂𝑖 ← // normalise
√𝜎2 Β+ϵ
References
Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad, editor, Online
Learning in Neural Networks. Cambridge University Press, Cambridge, UK.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research.
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In AISTATS’2010.
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).
Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13, pages 1319– 1327.
Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift.
Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Zhou, Y. and Chellappa, R. (1988). Computation of optical flow using a neural network. In
Neural Networks, 1988., IEEE International Conference on, pages 71–78. IEEE.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). Cs231n: Convolutional neural networks for
visual recognition. Recuperado de: http://cs231n.github.io.
137
CHAPTER SEVEN
GENETIC ALGORITHMS
_______________________________________________________
To tackle this problem, some global search evolutionary algorithms (EAs), such as the genetic
algorithm (GA), are employed for searching in a large, complex, non-differentiable and
multimodal domain. Recently, neural or neural-fuzzy networks trained by GA are reported .
The same GA can be used to train many different networks regardless of whether they are
feed-forward, recurrent, or of other structure types. This generally saves a lot of human efforts
in developing training algorithms for different types of networks.
The process of the genetic algorithm is shown in Fig. 7.1. The description is in the following
section (Section 7.1).
138
Procedure simple GA
begin
end
139
Assume that we wish to construct a genetic algorithm to solve the above problem, i.e., to
maximise the function f. Let us discuss the major components of such a GA in turn.
7.2.1 Representation
We use a binary vector as a chromosome to represent real values of the variable x. The length
of the vector depends on the required precision, which, in this example, is six places after the
decimal points.
The domain of the variable x has length 3; the precision requirement implies that the range
[−1.. 2] should be divided into at least 3×1,000,000 equal size ranges. This means that 22 bits
are required as a binary vector (chromosome):
The mapping from a binary string b21b20 … b0 into a real number x from the range [−1.. 2] is
straightforward and is completed in two steps:
• convert the binary string b21b20 … b0 from the base 2 to base 10:
(b21b20 … b0)2 = (∑21 𝑖
𝑖=10 𝑏𝑖 × 2 )10 = x,
where −1.0 is the left boundary of the domain and 3 is the length of the domain.
For example, a chromosome
(1000101110110101000111)
x=(1000101110110101000111)2 = 2288967
3
and 𝑥 = −1.0 + 2288967 ×
222 −1
= 0.637197.
v1 = (1000101110110101000111),
v2 = (0000001110000000010000),
v3 = (1110000000111111000101),
correspond to values x1=0.637197, x2 = −0.958973, and x3 = 1.627888, respectively.
Consequently, the evaluation function would rate them as follows:
Clearly, the chromosome v3 is the best of the three chromosomes, since its evaluation returns
the highest value.
141
During the alteration phase of the genetic algorithm we would use two classical genetic
operators: crossover and mutation.
Crossover Operator
Let us illustrate the crossover operator on chromosomes v2 and v3. Assume that the crossover
point was (randomly) selected after the 5 th gene:
v2 = (00000|01110000000010000),
v3 = (11100|00000111111000101),
The two resulting offspring are
v2 = (00000|00000111111000101),
v3 = (11100|01110000000010000),
These offspring evaluate to
eval(v2) = f(−0.998113) = 0.940865,
Note that the second offspring has a better evaluate than both of its parents.
Mutation Operator
Mutation alters one or more genes (positions in a chromosome) with a probability equal to the
mutation rate. Assume that the fifth gene from the v3 chromosome was selected for a mutation.
Since the fifth gene in this chromosome is 0, it would be flipped into 1. So the chromosome v3
after this mutation would be
v3 = (1110100000111111000101).
The chromosome represents the value x3 = 1.721638 and f(x3) = −0.082257. This means that
this particular mutation resulted in a significant decrease of the value of the chromosome v3.
On the other hand, of the 10th gene was selected for mutation in the chromosome v3, then
v3 = (1110000001111111000101).
The corresponding value x3 = 1.630818 and f(x3) = 2.343555, an improvement over the
original value of f(x3) = 2.250650.
142
7.2.5 Parameters
For this particular problem we have used the following parameters: population size pop_size
= 50, probability of crossover pc = 0.25, probability of mutation pm=0.01. The following section
presents some experimental results for such a genetic system.
In Table 7.1 we provide the generation number for which we noted an improvement in the
evaluation function, together with the value of the function. The best chromosome after 150
generations was
vmax=(1111001101000100000101),
which corresponds to a value xmax = 1.850773. Finally, we obtained xmax = 1.850773 and f(xmax)
= 2.850227 by using GA.
Table 7.1 Experimental results of 150 generations for function f(x) = x sin(10 x)+1.0 using
genetic algorithm
Generation Evaluation
Number function
1 1.441942
6 2.250003
8 2.250283
10 2.250363
12 2.328077
39 2.344251
40 2.345087
51 2.738930
99 2.849246
137 2.850217
145 2.850227
143
In this session, we discuss the actions of a genetic algorithm for a simple parameter
optimisation problem. We start with a few general comments; a detailed example follows.
Let us note first that, without any loss of generality, we can assume maximisation problems
only. If the optimisation problem is to minimise a function f, this is equivalent to maximising
a function g where g = −f , i.e.,
Moreover, we may assume that the objective function f takes positive values on its domain;
otherwise we can add some positive constant C, i.e.,
It is clear that to achieve such precision each domain Di should be cut into (bi − ai)×106 equal
size ranges. Let us denote by mi the smallest integer such that (bi − ai)×106 2𝑚𝑖 −1. Then, a
representation having each variable xi coded as a binary string of length mi clearly satisfies
the precision requirement. Additionally, the follow formula interprets each such string:
𝑏 −𝑎
xi = ai + decimal(1001…0012) × 2𝑖𝑚𝑖 −1𝑖 , (7.4)
Now, each chromosome (as a potential solution) is represented by a binary string of length
𝑚 = ∑𝑘𝑖=1 𝑚𝑖 ; the first m1 bits map into a value from the range [a1, b1], the next group of m2
bit map into a value from the range [a2, b2], so on; the last group of mk bit map into a value
from the range [ak, bk].
144
The rest of the algorithm is straightforward: in each generation we evaluate each chromosome
(using the function f on the decoded sequences of variables), select new population with
respect to the probability distribution based on fitness values, and alter the chromosomes in
the new population by crossover and mutation operators. After some number of generations,
when no further improvement is observed, the best chromosome represents an (possibly the
global) optimal solution. Often we stop the algorithm after a fixed number of iterations
depending on speed and resource criteria.
For the selection process (selection of a new population with respect to the probability
distribution based on fitness values), a roulette wheel with slots sized according to fitness is
used. We construct such a roulette wheel as follows:
The selection process is based on spinning the roulette wheel pop_size times; each time we
select a single chromosome for a new population in the following way:
Now we are ready to apply the recombination operator, crossover, to the individuals in the
new population. As mentioned earlier, one of the parameters of a genetic algorithm is
probability of crossover pc. This probability gives us the expected number pc×pop_size of
chromosomes which undergo the crossover operation. We proceed in the following way:
Now we mate selected chromosomes randomly: for each pair of coupled chromosomes we
generate a random integer number pos from the range [1..m−1] (m is the total length – number
of bits – in a chromosome). The number pos indicates the position of the crossing point. Two
chromosomes
The next operator, mutation, is performed on a bit-by-bit basis. Another parameter of the
genetic algorithm, probability of mutation pm, given us the expected number of mutated bits
pm×m×pop_size. Every bit (in all chromosomes in the whole population) has an equal chance
to undergo mutation, i.e., change from 0 to 1 or vice verse. So we proceed in the following
way.
For each chromosome in the current (i.e., after crossover) population and for each bit within
the chromosome:
Example 7.1
where −3.0 x1 12.1 and 4.1 x2 5.8. The graph of the function f is given in Figure 7.3.
146
Figure 7.3 Graph of the function f(x1,x2) = 21.5+ x1 sin(4 x1) + x2 sin(20 x2)
Let assume further that the required precision is four decimal places for each variable. The
domain of variable x1 has length 15.1; the precision requirement implies that the range [−3.0,
12.1] should be divided into at least 15.1×10000 equal size ranges. This means that 18 bits
are required as the first part of the chromosome:
217<151000218 .
The domain of variable x1 has length 1.7; the precision requirement implies that the range
[4.1, 5.8] should be divided into at least 1.7×10000 equal size ranges. This means that 15 bits
are required as the second part of the chromosome:
214<17000215 .
The total length of a chromosome (solution vector) is then m = 18+15 = 33 bits; the first 18
bits code x1 and remaining 15 bits code x2.
(010001001011010000111110010100010).
010001001011010000,
147
12.1 − (−3)
represent x1 = −3.0+decimal(010001001011010000)2 ×
218 −1
15.1
= −3+70352× = 1.052426.
262143
111110010100010,
5.8 −4.1
represent x2 = −3.0+decimal(111110010100010)2 ×
215 −1
1.7
= 4.1+31906× = 5.755330.
32767
Assume that after the initialisation process we get the following population:
v1 = (100110100000001111111010011011111)
v2 = (111000100100110111001010100011010)
v3 = (000010000011001000001010111011101)
v4 = (100011000101101001111000001110010)
v5 = (000111011001010011010111111000101)
v6 = (000101000010010101001010111111011)
v7 = (001000100000110101111011011111011)
v8 = (100001100001110100010110101100111)
v9 = (010000000101100010110000001111100)
v10= (000001111000110000011010000111011)
v11= (011001111110110101100001101111000)
v12= (110100010111101101000101010000000)
v13= (111011111010001000110000001000110)
v14= (010010011000001010100111100101001)
v15= (111011101101110000100011111011110)
v16= (110011110000011111100001101001011)
v17= (011010111111001111010001101111101)
v18= (011101000000001110100111110101101)
v19= (000101010011111111110000110001100)
v20= (101110010110011110011000101111110)
148
During the evaluation phase we decode each chromosome and calculate the fitness function
values from (x1, x2) values just decoded. We get:
It is clear, that the chromosome v15 is the strongest one, and the chromosome v2 is the weakest.
Now the system constructs a roulette wheel for the selection process. The total fitness of the
population (refer to equation (7.5)) is
𝐹 = ∑20
𝑖=1 𝑒𝑣𝑎𝑙(𝒗𝑖 ) = 387.776822.
149
Now we are ready to spin the roulette wheel 20 times; each time we select a single
chromosome for a new population. Let us assume that a (random) sequence of 20 numbers
from the range [0..1] is:
The first number r = 0.513870 is greater than q10 and smaller than q11, meaning the
chromosome v11 is selected for the new population; the second number r = 175741 is greater
than q3 and smaller than q4, meaning the chromosome v4 is selected for the new population,
etc.
150
Now we are ready to apply the recombination operator, crossover, to the individuals in the
new population (vectors vi). The probability of crossover pc=0.25, so we expect that (on
average) 25% of chromosomes (i.e., 5 out of 20) undergo crossover. We proceed in the
following way: for each chromosome in the (new) population we generate a random number
r from the range [0..1]; if r < 0.25, we select that the sequence of random number is:
This means that the chromosomes v2, v11, v13, v18 were selected for crossover. (We were
lucky: the number of selected chromosomes is even, so we can pair them easily. If the number
of selected chromosomes were odd, we would either add one extra chromosome or remove
one selected chromosome – this choice is made randomly as well.) Now we mate selected
chromosomes randomly: say, the first two (i.e., v2 and v11) and the next two (i.e., v13 and
v18) are coupled together. For each of these two pairs, we generate a random integer number
pos from the range [1..32] (33 is the total length – number of bits – in a chromosome). The
number pos indicates the position of the crossing point. The first pair of chromosome is
v2 = (100011000|101101001111000001110010)
v11= (111011101|101110000100011111011110)
and the generated number pos = 9. These chromosomes are cut after the 9th bit and replaced
by a pair of their offspring:
v2 = (100011000|101110000100011111011110)
v11= (111011101|101101001111000001110010)
v13 = (00010100001001010100|1010111111011)
v18= (11101111101000100011|0000001000110)
and the generated number pos = 20. These chromosomes are cut after the 9th bit and replaced
by a pair of their offspring:
v13 = (00010100001001010100|0000001000110)
v18= (11101111101000100011|1010111111011)
151
v1 = (011001111110110101100001101111000)
v2 = (100011000101110000100011111011110)
v3 = (001000100000110101111011011111011)
v4= (011001111110110101100001101111000)
v5= (000101010011111111110000110001100)
v6 = (100011000101101001111000001110010)
v7= (111011101101110000100011111011110)
v8 = (000111011001010011010111111000101)
v9= (011001111110110101100001101111000)
v10 = (000010000011001000001010111011101)
v11= (111011101101101001111000001110010)
v12 = (010000000101100010110000001111100)
v13 = (000101000010010101001010111111011)
v14 = (100001100001110100010110101100111)
v15= (101110010110011110011000101111110)
v16 = (100110100000001111111010011011111)
v17= (000001111000110000011010000111011)
v18= (111011111010001000111010111111011)
v19= (111011101101110000100011111011110)
v20= (110011110000011111100001101001011)
The next operator, mutation, is performed on a bit-by-bit basis. The probability of mutation
pm = 0.01, so we expect that (on average) 1% of bits would undergo mutation. There are m ×
pop_size = 33×20 = 660 bits in the whole population; we expect (on average) 6.6 mutations
per generation. Every bit has an equal chance to be mutated, so, for every bit in the population,
we generate a random number r from the range [0..1]; if r < 0.01, we mutate the bit.
This means that we have to generate 660 random numbers. In a sample run, 5 of these numbers
were smaller than 0.01; the bit number and the random number are listed below:
The following table translates the bit position into chromosome number and the bit number
within the chromosome:
152
This means that four chromosomes are affected by the mutation operator; one of the
chromosomes (the 13th ) has two bits changed.
The final population is listed below; the mutated bits are typed in boldface. We drop primes
for modified chromosomes: the population is listed as new vectors vi:
v1 = (011001111110110101100001101111000)
v2 = (100011000101110000100011111011110)
v3 = (001000100000110101111011011111011)
v4= (011001111110110101100001101111000)
v5= (000101010011111111110000110001100)
v6 = (100011000101101001111000001110010)
v7= (111011101101110000100011111011110)
v8 = (000111011001010011010111111000101)
v9= (011001111110110101100001101111000)
v10 = (000010000011001000001010111011101)
v11= (111011101101101001111000001110010)
v12 = (010000000101100010110000001111100)
v13 = (000101000010010101001010111111011)
v14 = (100001100001110100010110101100111)
v15= (101110010110011110011000101111110)
v16 = (100110100000001111111010011011111)
v17= (000001111000110000011010000111011)
v18= (111011111010001000111010111111011)
v19= (111011101101110000100011111011110)
v20= (110011110000011111100001101001011)
We have just completed one iteration (i.e., one generation) of the while loop in the genetic
procedure (Figure 7.1). It is interesting to examine the results of the evaluation process of the
new population. During the evaluation phase we decode each chromosome and calculate the
fitness function values from (x1, x2) values just decoded.
153
We get:
eval(v1) = f (3.130078, 4.996097) = 23.410669
eval(v2) = f (5.279042, 5.054515) = 18.201083
eval(v3) = f (−0.991471, 5.680258) = 16.020812
eval(v4) = f (3.128235, 4.996097) = 23.412613
eval(v5) = f (−1.746635, 5.395584) = 20.095903
eval(v6) = f (5.278638, 5.593460) = 17.406725
eval(v7) = f (11.089025, 5.054515) = 30.060205
eval(v8) = f (−1.255173, 4.734458) = 25.341160
eval(v9) = f (3.130078, 4.996097) = 23.410669
eval(v10)= f (−2.516603, 4.390380) = 19.526329
eval(v11)= f (11.088621, 4.743434) = 33.351874
eval(v12)= f (0.785406, 5.381472) = 16.127799
eval(v13)= f (−1.811725, 4.209937) = 22.692462
eval(v14)= f (4.910618, 4.703018) = 17.959701
eval(v15)= f (7.935998, 4.757338) = 13.666916
eval(v16)= f (6.084492, 5.652242) = 26.019600
eval(v17)= f (−2.554851, 4.793707) = 21.278435
eval(v18)= f (11.134646, 5.666976) = 27.591064
eval(v19)= f (11.059532, 5.054515) = 27.608441
eval(v20)= f (9.211598, 4.993762) = 23.867227
The total fitness of the new population F is 447.049688, much higher than total fitness of the
previous population, 387.776822. Also, the best chromosome now (v11) has a better evaluate
(33.351874) than the best chromosome (v15) from the previous population (30.060205).
Now we are ready to run the selection process again and apply the genetic operators, evaluate
the next generation, etc. After 1000 generations the population is:
v1 = (111011110110011011100101010111011)
v2 = (111001100110000100010101010111000)
v3 = (111011110111011011100101010111011)
v4= (111001100010000110000101010111001)
v5= (111011110111011011100101010111011)
v6 = (111001100110000100000100010100001)
v7= (110101100010010010001100010110000)
v8 = (111101100010001010001101010010001)
v9= (111001100010010010001100010110001)
v10 = (111011110111011011100101010111011)
v11= (110101100000010010001100010110000)
v12 = (110101100010010010001100010110001)
v13 = (111011110111011011100101010111011)
v14 = (111001100110000100000101010111011)
v15= (111001101010111001010100110110001)
v16 = (111001100110000101000100010100001)
v17= (111001100110000100000101010111011)
v18= (111001100110000100000101010111001)
v19= (111101100010001010001110000010001)
v20= (111001100110000100000101010111001)
154
However, if we look carefully at the progress during the run, we may discover that in earlier
generations the fitness values of some chromosomes were better than the value 35.477938 of
the best chromosome after 1000 generations. For example, the best chromosome in generation
396 had value of 38.827553. It is relatively easy to keep track of the best individual in the
evolution process. It is customary to store “ the best ever” individual at a separate location; in
that way, the algorithm would report the best value found during the whole process (as
opposed to the best value in the final population).
In Chapters 7.1-7.2, binary coded genetic algorithm (BCGA) is discussed. The BCGA
has some drawbacks when applying to multidimensional, high-precision numerical problems.
For example, if 100 variables in the range [−500, 500] are involved, and a precision of 6 digits
after the decimal point is required, the length of the binary solution vector is 3000. This, in
turn, generates a search space of about 23000 points. The performance of the BCGA will then
be poor. The situation can be improved if the GA in real (floating-point) numbers is used.
Each chromosome is coded as a vector of floating point numbers of the same length as the
155
solution vector. We called it is a real-coded generic algorithm (RCGA). A large domain can
thus be handled (e.g. the parameter space of a neural network).
Procedure RCGA
begin
end
The process of the RCGA is same as BCGA which shown in Figure 7.4. Noting that to
distinguish the chromosome and the genes of a population, we use P represent the population
for RCGA instead of V for BCGA, p represent the chromosomes for RCGA instead of v for
BCGA. A population of chromosomes P is first initialised, where 𝑷=
[𝒑1 𝒑2 ⋯ 𝒑𝑝𝑜𝑝_𝑠𝑖𝑧𝑒 ] , pop _ size is the number of chromosomes in the population.
(7.8)
j
paramin pi j paramax
j
, (7.9)
p max = para1max 2
paramax no _ vars
paramax , (7.10)
p m in = para 1m in para m2 in
para mnoin_ vars , (7.11)
𝑗 𝑗
where no_vars denotes the number of variables (genes); 𝑝𝑎𝑟𝑎𝑚𝑖𝑛 and 𝑝𝑎𝑟𝑎𝑚𝑎𝑥 are the
minimum and maximum values of 𝑝𝑖𝑗 respectively for all j.
156
Different genetic operations have been proposed to improve the efficiency of the RCGA.
Genetic operations usually refer to crossover and mutation.
For the crossover operation, single-point crossover (SPX), the arithmetic crossover, blend-
crossover (BLX-) have been developed.
Single-point crossover exchanges information from two selected chromosomes (p1 and p2),
where
𝒑1 = [𝑝11 𝑝12 𝑝13 ⋯ 𝑝1𝑛𝑜_𝑣𝑎𝑟 ] (7.12)
𝒑2 = [𝑝21 𝑝22 𝑝23 ⋯ 𝑝2𝑛𝑜_𝑣𝑎𝑟 ] (7.13)
It generates a random integer number r from a uniform distribution from 1 to no_var, and
creates new offspring 𝒑1′ and 𝒑′2 as follows.
157
Arithmetic crossover is defined as a linear combination of two selected chromosomes (𝒑1 and
𝒑2 ). The resulting offspring 𝒑1′ and 𝒑′2 are defined as,
where 𝑑𝑖 = |𝑝1𝑖 − 𝑝2𝑖 |, 𝑝1𝑖 and 𝑝2𝑖 are the i-th elements of 𝒑1 and 𝒑2 , respectively, and is a
positive constant.
For the mutation operation, the uniform mutation (random mutation), the boundary mutation
and non-uniform mutation can be found.
158
𝑗 𝑗
𝑝′1𝑗 = 𝑈(𝑝𝑎𝑟𝑎𝑚𝑎𝑥 , 𝑝𝑎𝑟𝑎𝑚𝑖𝑛 ), (7.20)
𝑗 𝑗
where 𝑈(𝑝𝑎𝑟𝑎𝑚𝑎𝑥 , 𝑝𝑎𝑟𝑎𝑚𝑖𝑛 ) is random number between the upper and lower bounds.
𝑗 𝑗
selected for mutation (the value of 𝑝′1𝑗 (mutated gene) is inside [𝑝𝑎𝑟𝑎𝑚𝑎𝑥 , 𝑝𝑎𝑟𝑎𝑚𝑖𝑛 ]), the
𝑗
𝑝1𝑗 + 𝛥 (𝑡, 𝑝𝑎𝑟𝑎𝑚𝑎𝑥 − 𝑝1𝑗 ) if 𝑟𝑑 = 0
𝑝′1𝑗 = { 𝑗
, (7.21)
𝑝1𝑗 − 𝛥 (𝑡, 𝑝1𝑗 − 𝑝𝑎𝑟𝑎𝑚𝑖𝑛 ) if 𝑟𝑑 = 1
where 𝑟𝑑 is a random number equal to 0 or 1 only. The function 𝛥(𝑡, 𝑦) returns a value in the
range [0, y] such that 𝛥(𝑡, 𝑦) approaches 0 as t increases. It is defined as follows.
𝑡 𝜁𝑛𝑢𝑚
(1− )
𝛥(𝑡, 𝑦) = 𝑦 (1 − 𝑟 𝑇 ), (7.22)
where r is a random number in range of [0, 1], t is the present generation number of the
population, T is the maximum generation number of the population, and 𝜁𝑛𝑢𝑚 is a system
parameter that determines the degree of non-uniformity.
159
One of the important issues on neural networks is the learning or training of the networks.
The learning process aims to find a set of optimal network parameters. One major weakness
of the gradient methods is that the derivative information is necessary such that the error
function to be minimized has to be continuous and differentiable. Also, the learning process
is easily trapped in a local optimum especially when the problems are multimodal and the
learning rules are network structure dependent. To tackle this problem, some global search
evolutionary algorithms (EAs), such as the genetic algorithm (GA), are employed for
searching in a large, complex, non-differentiable and multimodal domain.
For a single layer neural network shown in Figure. 7.5, the output of network is
governed by:
𝑤11 ⋯ 𝑤1𝐽
𝒛 = 𝜞(𝐖𝐲) where 𝐖 = [ ⋮ ⋱ ⋮ ] (7.23)
𝑤𝐾1 ⋯ 𝑤𝐾𝐽
y denotes input vector, z denotes output vector, W denotes the weight matrix, d denotes
desired output and (v) denotes non-linear operator respectively.
𝑦1 𝑧1 𝑑1 𝑤11 ⋯ 𝑤1𝐽
𝑦2 𝑧
𝐲 = [ … ], 𝐳 = [ …2 ], 𝐝 = [ 𝑑…2 ], 𝐖 = [ ⋮ ⋱ ⋮ ]
𝑦𝐽 𝑧𝐾 𝑑𝐾 𝑤𝐾1 ⋯ 𝑤𝐾𝐽
The objective of the Genetic Algorithm (GA) is used to optimise the W to minimise the mean
square error of the application.
1
𝑓𝑖𝑡𝑛𝑒𝑠𝑠 = (7.24)
1+𝑒𝑟𝑟
1
𝑒𝑟𝑟 = ∑𝐾
𝑖=1(𝑑𝑖 − 𝑧𝑖 )
2
(7.25)
𝐾
The objective is to maximize the fitness value of (7.24) using the GA by setting the
chromosome to be [𝑤11 … 𝑤𝑘𝑗 … 𝑤𝐾𝐽 ] for all j, k. It can be seen from (7.24) and (7.25)
that a larger fitness value implies a smaller error value.
160
References
Ling, S.H. 2010, Genetic Algorithm and Variable Neural Networks: Theory and Application,
Lambert Academic Publishing.
161