Hybrid Methods Using Evolutionary Algorithms
for On–line Training
G.D. Magoulas1,3 , V.P. Plagianakos2,3 , and M.N. Vrahatis2,3
(1)
Department of Information Systems and Computing, Brunel University,
Uxbridge UB8 3PH, UK. e–mail: George.Magoulas@brunel.ac.uk
(2)
Department of Mathematics, University of Patras,
GR–261.10 Patras, Greece. e–mail: {vpp,vrahatis}@math.upatras.gr
(3)
University of Patras Artificial Intelligence Research Center–UPAIRC.
Abstract
A novel hybrid evolutionary approach is presented in
this paper for improving the performance of neural network classifiers in slowly varying environments. For
this purpose, we investigate a coupling of Differential
Evolution Strategy and Stochastic Gradient Descent,
using both the global search capabilities of Evolutionary Strategies and the effectiveness of on–line gradient descent. The use of Differential Evolution Strategy
is related to the concept of evolution of a number of
individuals from generation to generation and that of
on–line gradient descent to the concept of adaptation
to the environment by learning. The hybrid algorithm
is tested in two real-life image processing applications.
Experimental results suggest that the hybrid strategy is
capable to train on–line effectively leading to networks
with increased generalization capability.
1 Introduction
Learning in Artificial neural Networks (ANNs) is usually achieved by minimizing the network’s error, which
is a measure of its performance, and is defined as the
difference between the actual output vector of the network and the desired one. This approach is very popular for training ANNs and includes training algorithms
that can be divided in two categories: batch, also called
off–line, and stochastic, also called on–line.
The batch training of ANNs is considered as the classical machine learning approach: a set of examples is
used for learning a good approximating function, i.e.
train the ANN, before the network is used in the application. Batch training is consistent with the theory of
unconstrained optimization and can be viewed as the
minimization of the function E; that is to find a set of
weights w∗ = (w1∗ , w2∗ , . . . , wn∗ ) ∈ Rn , such that:
w∗ = minn E(w),
w∈R
(1)
where E is the batch error measure defined as the sum–
of–squared–differences error function over the entire
training set.
The rapid computation of such a minimizer is a rather
difficult task since, in general, the dimension of parameter space is high and the error function generates
a complicated surface in this space, possessing multitudes of local minima and having broad flat regions
adjoined to narrow steep ones that need to be searched
to locate an “optimal” weight set.
On the other hand, in on–line training, the function
E is pattern-based and is defined as the instantaneous
squared–differences error function with respect to the
currently presented training pattern. In this case, the
ANN weights are updated after the presentation of each
training example, which may be sampled with or without repetition. On–line training may be the appropriate choice for learning a task, either because of the very
large (or even redundant) training set, or because of
the slowly time–varying nature of the task. Moreover,
it helps escaping local minima and provides a more natural approach for learning time varying functions and
continuously adapt in a changing environment. As Sutton pointed out, [24], “on–line learning is essential if we
want to obtain learning systems as opposed to merely
learned ones”.
In practice, on–line methods seem to be more robust
than batch methods as errors, omissions or redundant
data in the training set can be corrected or ejected
during the training phase. Additionally, training data
can often be generated easily and in great quantities
when the system is in operation, whereas they are usually scarce and precious before. Furthermore, on–line
training, and/or on–line retraining, of ANNs is very important in many real–time reactive environments. For
example, when we require to control the steering direction of a autonomous vehicle system under various road
conditions [5], or recognize, detect and extract objects
in images and video sequences under variable perceptual conditions (shading, shadows, lighting conditions,
and reflections) [4, 7, 13, 14, 27].
Despite the abundance of methods for learning from examples, there are only few that can be used effectively
for on–line learning. For example, the classic batch
training algorithms cannot straightforwardly handle
nonstationary data. Even when some of them are used
in on–line training there exists the problem of “catastrophic interference”, in which training on new examples interferes excessively with previously learned examples leading to saturation and slow convergence [25].
Note that in this context it is not possible to use advanced optimization methods, such as conjugate gradient, variable metric, simulated annealing etc., as these
methods rely on a fixed error surface [19]. Consequently, given the inherent efficiency of stochastic gradient descent, various schemes have been recently proposed based on this idea, [2, 19, 21, 22, 24]. However,
these schemes suffer from several drawbacks, such as
sensitivity to learning parameters [19].
This paper proposes a new Hybrid Evolutionary Algorithm (HEA) for on–line training. The HEA could
conceptually be split-up into two stages. In the first
stage, on–line training is adopted using a recently proposed stochastic gradient descent with adaptive stepsize [11]. In the second stage, a Differential Evolution
(DE) Strategy, [23], is used for on–line retraining. The
usage of DE Strategy is based on the assumption that
the first stage has produced a “good” solution that can
be incorporated directly into the genes and inherited
by offspring.
The rest of the paper is organized as follows: the use of
Evolutionary Algorithms in ANNs training is discussed
in Section 2. In Section 3, the new hybrid on–line training algorithm is introduced. Section 4 presents details
on the application of the hybrid method in training
on–line ANNs for real-life image recognition problems
and outlines the implementation results. Finally, in
Section 5, conclusions and a short discussion of future
work are presented.
2 Evolutionary Algorithms in ANN Training
Evolutionary Algorithms (EAs) are stochastic search
methods that mimic the metaphor of natural biological
evolution. They operate on a population of potential
solutions applying the principle of survival of the fittest
to produce better and better approximations to a solution. At each generation, a new set of approximations
is created by the process of selecting individuals according to their level of fitness in the problem domain and
breeding them together using operators borrowed from
natural genetics [3]. Many attempts have been made
within the artificial intelligence community to integrate
EAs and ANNs. A number of attempts has concentrated on applying evolutionary principles to improve
the generalization of ANNs, discover the appropriate
network topology, and the best available set of weights
(see [18]).
The majority of approaches in which evolutionary principles are used in conjunction with ANN training formulates the problem of finding the weights of a fixed
neural architecture, when the whole set of examples is
available, as an optimization problem. EAs are global
search methods and, thus, less susceptible to local minima [16]. Nevertheless, EAs remain, in certain cases,
more computationally expensive than training by a
variant of the backpropagation method [17]. This is
one of the reasons that narrow the applicability of EAs
to off–line ANN training.
In previous work, we demonstrated the efficiency of a
special class of EAs, called Differential Evolution (DE)
strategies, [12, 23], in off–line training [15]. DE strategies can handle non differentiable, nonlinear and multimodal objective functions efficiently, and require few
easily chosen control parameters. Experimental results have shown that DE strategies have good convergence properties and outperform other evolutionary
algorithms [15].
To apply DE strategies to ANN training we start with a
specific number (NP) of n–dimensional weight vectors,
as an initial weight population, and evolve them over
time; NP is fixed throughout the training process and
the weight population is initialized randomly following
a uniform probability distribution. At each iteration,
called generation, new weight vectors are generated
by the combination of weight vectors randomly chosen
from the population. This operation is called mutation. The outcoming weight vectors are then mixed
with another predetermined weight vector – the target
vector – and this operation is called crossover. This
operation yields the so–called trial vector. The trial
vector is accepted for the next generation if and only
if it reduces the value of the error function E. This
last operation is called selection. The above mentioned
operations introduce diversity in the population and
are used to help the algorithm escape the local minima
in the weight space. The combined action of mutation and crossover is responsible for much of the effectiveness of DE’s search, and allows them to act as
parallel, noise–tolerant hill–climbing algorithms, which
efficiently search the whole weight space.
As in this work we focus on the on–line training and retraining of ANNs, we adopt a formulation of this problem which is based on tracking the changing location
of the minimum of a pattern-based, and, thus, dynamically changing, error function. This approach coincides
with the way adaptation in the evolutionary time scale
is considered [20], and allows us to explore and expand
further research on the tracking performance of evolution strategies and genetic algorithms [1, 20, 26].
3 The Hybrid Evolutionary Algorithm
In this section, we present a Lamarck–inspired combination of Differential Evolution strategy and Stochastic
Gradient Descent (SGD). The DE strategy works on
the termination point of the SGD. Thus, the method
consists of a SGD–based on–line training stage and an
Evolutionary strategy–based on–line retraining stage.
A generic description of the proposed hybrid algorithm,
is given in Algorithm 1. First, the SGD is outlined in
the Stage 1 of Algorithm 1, where η is the stepsize, K
is the meta–stepsize and h·, ·i stands for the usual inner product in Rn . The memory–based calculation of
the stepsize, in Step 4a, takes into consideration previously computed pieces of information to adapt the
stepsize for the next pattern presentation. This provides some kind of stabilization in the calculated values of the stepsize, and helps the stochastic gradient
descent to exhibit fast convergence and high success
rate. Note that the classification error, an upper limit
to the error function evaluations, or a pattern-based error measure can be used as the termination condition
in Step 5a. The key features of the SGD method are
the low storage requirements and the inexpensive computations. Moreover, in order to calculate the stepsize
to be used at the next iteration, this on–line algorithm
uses information from the current, as well as the previous iteration.
In Stage 2 of Algorithm 1, the DE strategy, responsible for the on–line retraining is outlined. Steps 3b and
4b implement the mutation and crossover operators,
respectively, while Step 5b is the selection operator.
The first DE operator used is the mutation operator.
Specifically, for each weight vector wip , a new vector
called mutant vector is generated according to the following relation:
Mutant Vector = wip + ξ(wbest − wip ) + ξ(wr1 − wr2 ),
where wbest is the best member of the previous generation, ξ > 0 is a real parameter called mutation constant
and controls the amplification of the difference between
two weight vectors, and wr1 and wr2 are two randomly
chosen weight vectors, different from wip .
To increase further the diversity of the mutant weight
vector, the crossover operator is applied. Specifically,
for each component j, (j = 1, 2, . . . , n), of the mutant
weight vector, we randomly choose a real number r
from the interval [0, 1]. Then, we compare this number
with ρ > 0 (crossover constant), and if r 6 ρ we select,
as the j-th component of the trial vector, the corresponding component j of the mutant vector. Otherwise, we pick the j-th component of the target vector.
4 Experiments and Results
We have tested the proposed hybrid algorithm in two
real–life classification tasks. The first experiment concerns training on–line an ANN classifier to discriminate
among 12 texture images, and the second one, training
an ANN to detect suspicious regions in colonoscopic
video sequences. In all cases, no operation for tuning
the mutation and crossover constants was carried out;
default fixed values ξ = 0.5 and ρ = 0.7 have been
used.
4.1 The texture classification problem
A total of 12 Brodatz texture images of size 512 ×
512, [6], as shown in Figure 1, was acquired by a scanner at 150dpi. From each texture image, 10 subimages of size 128 × 128 were randomly selected, and the
co–occurrence method, [8], was applied. In the co–
occurrence method, the relative frequencies of gray–
level pairs of pixels at certain relative displacements
are computed and stored in a matrix. The combination of the nearest neighbor pairs at orientations 0o ,
45o , 90o and 135o is used in the experiment. A set
Step
Step
Step
Step
0a:
1a:
2a:
3a:
Step 4a:
Step 5a:
Step 6a:
Step
Step
Step
Step
Step
Step
Step
Step
0b:
1b:
2b:
3b:
4b:
5b:
6b:
7b:
Stage 1 - “Learning”
Initialize the weights w0 , η 0 and the meta–stepsize K.
Repeat for each pattern p.
Calculate E(wp ) and then ∇E(wp ).
Update the weights:
wp+1 = wp − η p ∇E(wp ).
Calculate the stepsize to be used with the next pattern p + 1:
η p+1 = η p + K ∇E(wp−1 ), ∇E(wp ) .
Until the termination condition is met.
Return the final weights wp+1 to the Stage 2.
Stage 2 - “Evolution”
Initialize the DE population in the neighborhood of wp+1 .
Repeat for each input pattern p.
For i = 1 to N P
MUTATION(wip ) → Mutant Vector.
CROSSOVER(Mutant Vector) → Trial Vector.
If E(Trial Vector) 6 E(wip ), accept Trial Vector for the next generation.
EndFor
Until the termination condition is met.
Algorithm 1: Generic Model of the Hybrid On–line Training Algorithm
of 10 sixteenth–dimensional training patterns was extracted from each image.
Figure 1: The twelve texture images.
A 16–8–12 ANN (224 weights, 20 biases) was trained
on–line to classify the patterns into the 12 texture
types. The network used neurons of logistic activations
with biases, and the weights and biases were initialized
with random numbers from the interval (−1, 1). The
termination condition for the first stage was a classification error CE 6 3%. Then, the second stage was
executed for on–line retaining using new patterns from
the training set. At the end, the generalization capability of the trained network was tested on the test
set, which consisted of 320 patterns (20 patterns from
each image extracted from randomly selected subimages). The ANN correctly classified 304 out of 320
patterns. Thus, it exhibited 95% generalization success. In the same task, the performance of the SGD
alone, i.e. without using the evolution stage of the algorithm for the on–line retraining, was 93%, while the
performance of the batch backpropagation algorithm
with variable stepsize, [10], was 90%.
4.2 Abnormalities detection by colonoscopy
Colonoscopy is a minimal invasive technique for the
production of medical images. A narrow pipe like
structure, an endoscope, is passed into the patient’s
body. Video endoscopes have small cameras in their
tips. When passed into a body, what the camera observes is displayed on a television monitor (see Figure 2
for a sample frame of the video sequence). The physician controls the endoscope’s direction using wheels
and buttons. An important stage of the implementation is the feature extraction process [9]. In our experiments we have used the co–occurrence matrices to
generate features. More specifically, each frame of the
endoscopic video sequence was separated into windows
of size 16 pixels by 16 pixels. Then, the co–occurrence
matrices algorithm was used to gather information regarding each pixel in an image window, and to generate
feature vectors that contain sixteen elements each.
A 16–30–2 ANN (540 weights, 32 biases) with logistic
activations was trained on–line to discriminate between
normal and suspicious image regions using 300 randomly selected patterns from the first frame. On–line
training stopped when the ANN exhibited 3% misclassifications on the training set. It must be noted that the
first stage was extremely fast; approximately 40 training epochs were needed. For on–line retraining, the DE
population has been initialized with weight vectors in
the neighborhood of the weight vector obtained from
the first stage. In order to test the tracking performance of the hybrid algorithm, we introduced in the
training set patterns from other frames of the same
video sequence, which exhibited resolution change, different perceptual direction of the physician, different
diffused light conditions. Thus, on–line retraining was
performed using a training set of 1200 patterns. The
DE algorithm was allowed to perform only two iteration
with each pattern. This was necessary to prevent the
“catastrophic interference” among patterns of different
frames. To test the performance of the trained ANN
approximately 4000 patterns have been extracted from
each frame. The 4000 patterns cover the whole image
region of a frame and contain normal and suspicious
samples.
aged to locate weights that are eminently suitable for
all of the four frames. Thus, the ANN trained with
the hybrid method provides higher percentage of generalization in all cases when compared with the four
specially trained ANNs.
Frame
Frame
Frame
Frame
1
2
3
4
Without Evolution
83.77%
77.18%
82.84%
87.60%
With Evolution
91.91%
83.57%
93.09%
89.24%
Table 1: Results from the interpretation of images.
5 Conclusions
A new hybrid method for on–line neural network training has been developed, tested and applied to a texture classification problem and a tumor detection problem in colonoscopic video sequences. Simulation results
suggest that the new method exhibits fast and stable
learning, good generalization and therefore a great possibility of good performance. The proposed algorithm
is able to train large networks on–line, and seems better
suited for tasks with large, redundant or slowly time–
varying training sets, such as those of image analysis.
Further work is needed to optimize the hybrid algorithm performance, as well as to test it on even bigger
training sets.
Acknowledgements
Figure 2: Frame of the colonoscopic video sequence.
The generalization results with and without the evolution stage of the algorithm, are exhibited in Table 1.
The first column of Table 1 exhibits results from training a special ANN for each frame and, then, testing
it using data from the same frame without on–line retraining. From example, let us observe Frame 1. The
corresponding ANN was trained using data extracted
from Frame 1 and achieved a recognition success of
83.77% when tested with the whole frame. Similarly,
a percentage of success of 87.60% was achieved by the
ANN that was trained on Frame 4. On the other hand,
the hybrid method by applying on–line retraining man-
The authors gratefully acknowledge the contribution
of Dr. S. Karkanis and Mr. D. Iakovidis of the Image
Processing Lab at the Department of Informatics and
Telecommunications, University of Athens, Greece, in
the acquisition of the data.
References
[1] P. Angeline, “Tracking extrema in dynamic environments”. In: Proc. of the Sixth Annual conference
on Evolutionary Programming VI, 335-345, Springer,
(1997).
[2] L.B. Almeida, T. Langlois, J.D. Amaral, and
A. Plankhov, “Parameter adaptation in Stochastic Optimization”. In: On–line Learning in Neural Networks,
D. Saad, (ed.), 111–134, Cambridge University Press,
(1998).
[3] T. Bäck and H.P. Schwefel, “An overview of evolutionary algorithms for parameter optimization”, Evolutionary Computation, 1, 1–23, (1993).
[4] S.W. Baik and P. Pachowicz, “Adaptive object recognition based on the radial basis function
paradigm”. In:Proc. of the IEEE Int. Joint Conf. on
Neural Networks (IJCNN’99), Washington, U.S.A.,
(1999). CD-ROM Proceedings, Paper No.215, Session
9.4.
[5] S. Baluja, “Evolution of an artificial neural network based autonomous land vehicle controller”, IEEE
Transactions on System, Man and Cybernetics-Part B,
26, 450–463, (1996).
[6] P. Brodatz, Textures – a Photographic Album for
Artists and Designers. New York, Dover, (1966).
[7] A.D. Doulamis, N.D. Doulamis, S.D. Kollias,
“On–line retrainable neural networks: improving the
performance of neural networks in image analysis problems”, IEEE Transactions on Neural Networks, 11,
137–155, (2000).
[8] RM. Haralick, “Statistical and structural approaches to texture”. Proc. IEEE, 67, 786–804, (1979).
[9] S. Karkanis, G.D. Magoulas, and N. Theofanous,
“Image recognition and neuronal networks: Intelligent
systems for the improvement of imaging information”,
Minimal Invasive Therapy & Allied Technologies, 9,
225–230, (2000).
[10] G.D. Magoulas, M.N. Vrahatis, and G.S. Androulakis, “Effective back-propagation training with
variable stepsize”, Neural Networks, 10, 69-82, (1997).
[11] G.D. Magoulas, V.P. Plagianakos, and M.N. Vrahatis, “Adaptive stepsize algorithms for on–line training of neural networks”, Nonlinear Analysis: Theory,
Methods and Applications, in press, (2001).
[12] Z. Michalewicz and D.B. Fogel, How to solve it:
Modern heuristics, Springer, 2000.
[13] P.W. Pachowicz and S.W. Baik, “Adaptive RBF
classifier for object recognition in images sequences”.
In:Proc. of the IEEE Int. Joint Conf. on Neural Networks (IJCNN’2000), Como, Italy, vol. VI-600, (2000).
of the IEEE Int. Joint Conf. on Neural Networks
(IJCNN’2000), Italy, (2000).
[16] V.P. Plagianakos, G.D. Magoulas, and M.N. Vrahatis, “Learning in multilayer perceptrons using global
optimization strategies”, Nonlinear Analysis: Theory,
Methods and Applications, in press, (2001).
[17] V.P. Plagianakos, G.D. Magoulas, and M.N. Vrahatis, “Improved learning of neural nets through global
search”. In: Global Optimization - Selected Case Studies, J.D. Pintér (ed.), Kluwer Academic Publishers, to
appear, (2001).
[18] J.C.F. Pujol, and R. Poli, “Evolving the topology
and the weights of neural networks using a dualreprsentation”, Applied Intelligence, 8, 73–84, (1998).
[19] D. Saad, On–line learning in neural networks,
Cambridge University Press, (1998).
[20] R. Salomon and P. Eggenberger, “Adaptation
on the evolutionary time scale: a working hypothesis
and basic experiments”. In:Proc. of the Third European
Conference on Artificial Evolution (AE’97), Nimes,
France, Lecture Notes in Computer Science vol. 1363,
Springer, (1998).
[21] N.N. Schraudolph, “Online local gain adaptation for multi–layer perceptrons”. Technical Report,
IDSIA–09–98, IDSIA, Lugano, Switzerland, (1998).
[22] N.N. Schraudolph, “Local gain adaptation
in stochastic gradient descend”, Technical Report,
IDSIA–09–99, IDSIA, Lugano, Switzerland, (1999).
[23] R. Storn and K. Price, “Differential Evolution –
A simple and efficient heuristic for global optimization
over continuous spaces”, Journal of Global Optimization, 11, 341–359, (1997).
[24] R.S. Sutton, “Adapting bias by gradient descent:
an incremental version of delta–bar–delta”. In: Proc.
of the Tenth National Conference on Artificial Intelligence, MIT Press, 171–176, (1992).
[25] R.S. Sutton and S.D. Whitehead, “Online learning with random representations”. In: Proc. of the
Tenth International Conference on Machine Learning,
Morgan Kaufmann, 314–321, (1993).
[14] N.G. Panagiotidis, D. Kalogeras, S.D. Kollias,
and A. Stafylopatis, “Neural network-assisted effective
lossy compression of medical images”, Proc. IEEE, 84,
1474–1487, (1996).
[26] F. Vavak and T.C. Fogarty, “A comparative
stady of steady state and generational genetic algorithms”. In: Proceedings of Evolutionary Computing:
AISB Workshop, Lecture Notes in Computer Science
vol. 1143, Springer, (1996).
[15] V.P. Plagianakos and M.N. Vrahatis, “Training Neural Networks with Threshold Activation Functions and Constrained Integer Weights”. In: Proc.
[27] G.G. Wilknson, “Open questions in neurocomputing for earth observation”. In: Proc. of the first
COMPARES Workshop, York, U.K., 1996.