See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/263813244
Hybrid back-propagation training with
evolutionary strategies
Article in Soft Computing · August 2014
DOI: 10.1007/s00500-013-1166-8
CITATIONS
READS
0
52
3 authors:
Jose Parra
Leonardo Trujillo
3 PUBLICATIONS 4 CITATIONS
117 PUBLICATIONS 776 CITATIONS
Tijuana Institute of Technology
SEE PROFILE
Tijuana Institute of Technology
SEE PROFILE
Patricia Melin
Tijuana Institute of Technology
637 PUBLICATIONS 6,533 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Global Dynamics and its Applications on Engineering View project
Multi- Objective Evolutionary Algorithms for High-Level Synthesis View project
All content following this page was uploaded by Leonardo Trujillo on 02 April 2015.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Noname manuscript No.
(will be inserted by the editor)
Hybrid Back-Propagation Training with Evolutionary Strategies
José Parra · Leonardo Trujillo · Patricia Melin
Received: / Accepted:
Abstract This work presents a hybrid algorithm for
neural network training that combines the backpropagation (BP) method with an evolutionary algorithm. In the proposed approach, BP updates the network connection weights, and a (1+1) Evolutionary
Strategy (ES) adaptively modifies the main learning
parameters. The algorithm can incorporate different
BP variants, such as gradient descent with adaptive
learning rate (GDA), in which case the learning rate
is dynamically adjusted by the stochastic (1+1)-ES as
well as the deterministic adaptive rules of GDA; a
combined optimization strategy known as memetic
search. The proposal is tested on three different domains, time series prediction, classification and biometric recognition, using several problem instances.
Experimental results show that the hybrid algorithm
can substantially improve upon the standard BP methods. In conclusion, the proposed approach provides
a simple extension to basic BP training that improves
performance and lessens the need for parameter tuning in real-world problems.
Keywords Neural Networks, Back-propagation,
Evolutionary Strategies
1 Introduction
Artificial neural networks (ANNs) are one of the most
widely used paradigms in pattern analysis and machine learning research. In particular, the multi-layer
José Parra
Instituto Tecnológico de Tijuana, Av. Tecnolǵico, Fracc. Tomás
Aquino, Tijuana, B.C., México
Tel.: +52-664-6827229
E-mail: galavizjpg@gmail.com,
leonardo.trujillo@tectijuana.edu.mx,pmelin@tectijuana.mx
perceptron (MLP) has proven to be a powerful and
versatile tool in many domains, including recognition
(Melin and Castillo, 2005), classification (Zhang, 2000)
and time series prediction (Castillo and Melin, 2002).
MLPs are feed-forward and fully connected ANNs
that are trained with supervised learning methods, the
most common of which is the back-propagation (BP)
algorithm (Rumelhart et al., 1986). BP is a gradient descent method that propagates an error measure (such
as the mean square error) from the output layer of the
network to the input layer, taking into account all hidden layers in between. For years this algorithm has
been considered the standard approach for supervised
MLP training. However, BP is also hampered by three
noteworthy shortcomings. First, BP suffers from learning problems, such as overtraining and sometimes a
slow convergence depending on the characteristics of
the problem and the initial connection weights. Second, it can lead to network paralysis where the algorithm is unable to significantly modify the connection
weights to achieve performance improvements. And
third, it often gets trapped in local minima, a common problem for gradient-based methods. To overcome these shortcomings, several improvements to
basic BP training have been proposed (Isasi Viñuela,
2004). For instance, gradient descent with adaptive
learning rate (GDA) or gradient descent with momentum (GDM).
Currently, the improved variants of BP are widely
used when training MLPs. Nevertheless, one drawback of these methods is that they tend to require
several ad-hoc decisions to correctly apply them in
real-world problems. Furthermore, they introduce several new parameters to achieve adaptation during
training, but the parameters themselves remain constant throughout the entire process. This constrains
2
the manner in which the learning process searches for
the minimum over the error surface. One common approach to overcome these limitations is using more robust variants of BP, such as the Rprop algorithm (Riedmiller, 1994). Still others have turned to population
based optimization methods, such as evolutionary algorithms, to search for the optimal set of connection
weights, thereby avoiding gradient-based learning altogether (Yao, 1999; Cantú-Paz and Kamath, 2005;
Kiranyaz et al., 2009). However, such an approach
does not exploit the powerful local optimization that
gradient-based methods can offer. Therefore, we propose a hybrid approach that combines the learning improvements of previously proposed methods with the
global search of evolutionary algorithms.
The proposal developed here, dynamically modifies the main parameters of a BP algorithm in an unconstrained manner using a (1+1) Evolutionary Strategy (ES). The canonical training methods are easily incorporated within the evolutionary loop of the (1+1)ES, yielding a simple hybrid learning strategy. Indeed, Cantú-Paz and Kamath (2005) performed a comprehensive comparison of different evolutionary approaches for neural network training and found that
simpler methods performed the best; therefore, simplicity was a guiding principle in the proposal presented here. The experimental work evaluates the performance of this proposal using benchmark problems
of time series prediction, classification and biometric
recognition. Results are promising, achieving a substantial improvement in most cases and never performing substantially worse.
The remainder of this paper is organized as follows. Section 2 contains a brief introduction to the BP
algorithm and some of the most widely used variants.
Then, Section 3 presents the problem addressed in this
work and introduces the hybrid training proposal. Section 4 outlines previous work in hybrid evolutionaryneural network research, while Section 5 gives a detailed description of the proposed algorithm and its
memetic variants. Afterwards, Section 6 contains the
experimental setup and results on several benchmark
problems. Finally, concluding remarks are given in
Section 7.
2 Background and basic concepts
2.1 Back-propagation training
The standard supervised learning method for MLPs
is BP, a gradient descent optimization algorithm that
backwardly propagates the error between the network
output and the desired output. BP uses this error to
modify the network connection weights based on the
gradient and on a learning rate parameter β, which
modulates the step size of the weight updates. Despite the success of BP training, it is well-known that
it suffers from several limitations. Therefore, many researchers have proposed algorithmic improvements.
2.1.1 Back-propagation algorithm
In an MLP, the input pattern is propagated forward
through the network, and this results in a vector of activation values in the output layer, which means that
the MLP basically acts as a function. The behavior of
the function can be changed by modifying the connection weights between individual neurons. Starting
from an initial set of random connection weights, the
objective of BP is to adaptively modify these weights
in order to achieve a particular input/output behavior. In fact, the overall goal is to minimize the error E
given by
E=
1
(dp − yp )2 ,
2
(1)
where dp is the desired output and yp is the actual network output obtained for each input pattern xp 1 . In order to do so, the connection weight wijl between neuron i in layer l and neuron j in layer l + 1 is modified
at each epoch t by the following rule
wijl (t + 1) = wijl (t) + ∆wijl (t) ,
(2)
where
∆wijl (t) = βδi(l+1)p (t)yjlp (t)
(3)
δi(l+1)p (t) is the generalized error term at neuron i in
layer l + 1 for pattern p, given by the product of the
first derivative of the activation function and error E,
yjlp (t) is the output of neuron j in layer l for pattern
p, and where β is the learning rate parameter. For a
more complete description of BP training the interested reader is referred to (Rumelhart et al., 1986; Radi
and Poli, 2003).
2.2 BP with adaptive learning rate
One of the earliest improvements to BP was to adaptively modify the learning rate on-line during training (Hagan et al., 1996). In standard BP the learning
rate is constant during the entire training process, it is
thus imperative to choose a correct initial value. For instance, if the learning rate is set too high the algorithm
1 The error measure E given above is just one possible measure that can be used.
3
may oscillate and become unstable. Conversely, if it is
too small then convergence will be slow. However, setting an optimal value for β is not trivial. Moreover, the
optimal value might change during the training process. Therefore, if β is allowed to change during training the quality of the learning process might improve.
The idea is to make β responsive to the structure of the
local error surface.
In order to implement this idea, the GDA algorithm
modifies BP in the following ways. First, the initial network output and error are calculated. At each epoch
new weights are computed using the current β. New
outputs and errors are then measured. If the new error exceeds the old error by more than a predefined
threshold then the new weights are discarded and β
is decreased by a fixed amount, call this parameter
β∆− . Otherwise, the new weights are kept, and if the
new error is less than the old error, the learning rate
β is increased by a constant parameter β∆+ . Therefore, if a larger β could result in stable learning then
it is increased. On the other hand, if the learning rate
is too high to guarantee a decrease in error then it is
decreased until stable learning resumes (Hagan et al.,
1996).
2.3 BP with momentum
Another proposed improvement is the gradient descent with momentum (GDM) algorithm, which takes
a different approach towards overcoming some of the
shortcomings of BP. It is equivalent to BP, with an
added parameter called the momentum coefficient γ
which is used to modify the weight update rule as follows (Samarasinghe, 2006)
∆wijl (t) = γ∆wijl (t−1)+(1−γ)βδi(l+1)p (t)yjlp (t) . (4)
In essence, γ produces an averaging effect by which
changes in connection weights consider both the current error value as well as past weight changes.
2.4 BP with momentum and adaptive learning rate
Based on the previous methods, GDA and GDM,
the gradient descent algorithm with momentum and
adaptive learning rates was proposed (GDX), which
combines the advantages of both (Hagan et al., 1996).
3 Problem description and main proposal
We have outlined three of the most common improvements to BP training, GDA, GDM and GDX. However,
the manner in which these methods operate also raises
other issues. For instance, in GDA the learning rate is
either increased or decreased by the fixed parameters
β∆+ and β∆− . One could argue that the value of these
parameters should also be subject to an adaptive process during training. Furthermore, in GDM the γ coefficient is held constant, and there is no a priori reason
to assume that such a strategy is optimal. Therefore, in
this work we hypothesize that a better learning strategy would be able to adaptively modify all of the main
parameters of the algorithm in an unconstrained manner. This follows from the basic argument behind the
GDA method, where it is assumed that because the error surface changes during training then the optimal β
should also change. Therefore, we argue that the same
logic must hold for parameters such as β∆+ and β∆− ,
and the γ coefficient. For example, in GDA β should
be able to increase or decrease during training without
the need of constant step sizes, and in GDM γ could
also be adaptively modified.
Therefore, we propose an improvement to BP that
can dynamically change the main parameters of the
learning algorithm without constant step values. In
order to achieve this we develop our proposal using a hybrid algorithm that combines an evolutionary search process and standard BP. Concretely, we use
evolutionary strategies as a global search method that
adapts the main learning parameters during training
and allows the gradient descent algorithm to perform
a local search over the MLP error surface. With the
hybrid approach, we combine the exploration capabilities of evolutionary search with the local optimization provided by gradient descent. Moreover, the proposal can incorporate the basic BP algorithm as well as
any of the previously proposed improvements (GDA,
GDM and GDX) without requiring substantial modifications. To contextualize the current contribution, the
following contains a brief overview of how evolutionary computation has intersected with ANN research in
previous works.
4 Evolutionary computation and ANNs
Evolutionary computation (EC) encompasses a large
variety of global search and optimization methods
that are based on an abstract model of Neo-Darwinian
evolutionary theory. Some of the most widely known
paradigms are genetic algorithms, evolutionary strategies and genetic programming, all of which are
based on similar conceptual principles (DeJong, 2002),
and are closely related to other population-based
meta-heuristics such as particle swarm optimization
4
(Kiranyaz et al., 2009) and ant colony optimization (Dorigo and Stützle, 2004). These methods have
proven to be quite robust and flexible, applicable to a
large variety of application domains and problem instances.
5 The proposed hybrid learning approach
This section presents our hybrid approach for BP
learning using evolutionary computation.
5.1 Evolutionary strategies
In the case of ANNs, many attempts have been
made to use evolutionary methods to optimize a specific characteristic of an ANN (Yao, 1999; Cantú-Paz
and Kamath, 2005). Probably the most common strategy is to use EC to determine the optimal connection
weights of an ANN (Yao, 1999; Fogel et al., 1990), in
some cases combining an EC algorithm with standard
learning techniques (Alba and Chicano, 2004). For instance, this approach has found strong acceptance in
robotics applications, an approach referred to as evolutionary robotics, where a well posed error gradient is not feasible (Nolfi and Floreano, 2000). Recently,
these approaches have allowed researchers to train extremely large networks by exploiting concepts from
developmental biology and indirect encoding schemes
(Stanley et al., 2009). Another possibility is to use EC to
search for an optimal network topology (Miller et al.,
1989) and then use standard learning methods to determine the connection weights of the network. Still
yet, others have attempted to reduce the amount of
a priori knowledge as much as possible by concurrently searching for the optimal network topology and
connection weights within a single evolutionary loop
(Harp et al., 1989; Stanley and Miikkulainen, 2002).
On the other hand, EC has also been used to develop improvements to the traditional learning process. For instance, some researchers use EC techniques
to determine the initial connection weights of an ANN,
instead of using random weights, from which the
learning algorithm can then proceed (Lee, 1996). Still
yet, others have focused on optimizing the BP parameters off-line, to find optimal values that can be used
throughout the entire learning process (Patel, 1996) or
by considering the learning parameters and connection weights concurrently (Merelo et al., 1993). Finally,
some researchers have used automatic program induction with genetic programming to derive new learning
rules through and evolutionary search (Radi and Poli,
2003). In our work, however, we are interested in developing an adaptive strategy similar to what is done
in GDA, with an additional evolutionary search that
allows for dynamic modifications of the BP parameters. A similar approach was proposed in (Kim et al.,
1996) with several important differences that are addressed in Section 5.5.
Evolutionary strategies (ES) are an optimization technique based on the core principles of EC highlighted
in the previous section. (Schwefel, 1981). In ES, candidate solutions are coded as real-valued parameter
vectors. In the canonical version only one operator is
used to generate new parameter vectors (offspring),
a Gaussian mutation that perturbs the value of each
parameter; for a detailed introduction see (Eiben and
Smith, 2003). Moreover, two selection strategies are
commonly used with ESs, (µ + λ) and (µ, λ), where µ
is the number of individuals in a population and λ is
the number of offspring generated at each generation.
In a (µ + λ)-ES, the individuals contained in the next
generational loop are chosen from the best solutions
from both the past population and the offspring. Conversely, in (µ, λ) the λ offspring replace all of the µ individuals in the previous population. In other words,
(µ + λ) is an elitist strategy while (µ, λ) is not (Eiben
and Smith, 2003).
5.2 Evolutionary Strategies for BP learning
As stated above, our proposal is to combine BP with
an evolutionary search. The goal is to provide a mechanism by which the main parameters of a BP algorithm
can be adaptively modified on-line during network
training. For this task we have chosen the (1+1)-ES, because:
– It is well-known and understood.
– It is particularly well suited for real-valued parameter optimization.
– It is very simple to implement, which allows us
to maintain the basic structure of BP unchanged.
From this it follows that the method will not dramatically increase the computational cost of training an ANN. Simplicity was explicitly considered
given the conclusions derived by the extensive
comparison of previously proposed evolutionary
approaches carried out by Cantú-Paz and Kamath
(2005).
The proposal is a hybrid learning process, such that
a (1+1)-ES adaptively changes the BP parameters after a specified number of epochs, during which the
BP training algorithm carries out the standard weight
5
updates. The (1+1)-ES represents the simplest type of
evolutionary search, that lacks the intrinsic parallel
nature of other population based evolutionary techniques, such as genetic algorithms or genetic programming. Therefore, the (1+1)-ES is closely related to other
heuristic search methods such as simulated annealing
(Kirkpatrick et al., 1983). Nonetheless, the aim of the
proposal is to develop a simple and robust improvement to BP training with a minimal amount of computational overhead, and, as we shall see in our results,
the(1+1)-ES fulfills these requirements.
The proposed algorithm proceeds as follows. First,
generate a BP parameter vector x with standard initial
values that are commonly used in the literature. In this
case, the number of parameters depends on the version of BP used. For instance, if we use GD or GDA
then x would contain only the β parameter. On the
other hand, if we use GDM or GDX, then x would contain β and the momentum coefficient γ. Afterwards,
randomly generate the initial connection weights of
the ANN called Ax, just as it would be done in standard BP. This leads to the first generation (iteration)
of the (1+1)-ES. Within the evolutionary loop, create a
mutated version of x called y, using a Gaussian mutation with the same σ for all elements. Then, make
a copy of Ax, call this Ay. Train Ax using the BP parameters specified in x for a total of ρ epochs, and the
same is done for Ay with the parameter values y. After training both networks we obtain a corresponding
convergence error from each, call these Ex and Ey,
respectively. The error values are used to determine
which ANN and which parameter vector will survive
for the following generation. This process is repeated
until one of two conditions is met: (1) the total number
of generations is reached; or (2) the goal error for the
ANN is achieved. In this algorithm there are two new
parameters. One is the number of epochs per generation denoted by ρ The other is the value of the step
size σ of the Gaussian mutation. In this work, σ is set
to a constant value of 0.2, while ρ is set using an extensive experimental evaluation described in the following section. It should be mentioned that some of
the most advanced, and successful, variants of ES are
those that adaptively modify σ by including it as another decision variable optimized by the evolving process (Eiben and Smith, 2003). In fact, in such a case an
ES can fit σ to the structure of the fitness landscape,
with the slight tradeoff of increasing the dimensionality of the search space and possibly making it more
complex. However, in our case the search space is not
fixed, that is, fitness evaluation is dynamic because
BP modifies the connection weights at each time step
(possibly several times) and thus changes the under-
lying structure of the fitness landscape. Therefore, we
choose not to include an adaptive mutation step size
in the evolutionary process.
5.3 Memetic search
Since the proposal in this work does not depend on
a specific BP variant, it is also possible to use learning algorithms that can also modify the learning parameters, in particular GDA and GDX which adapt
the learning rate β. If such is the case, this parameter
can be adapted by two separate mechanisms, the more
unconstrained global search carried out by ES and by
local optimization with gradient descent. This type of
combined optimization strategy is called a memetic algorithm within EC literature (Eiben and Smith, 2003),
and two variants exist. The first one is a Lamarckian
memetic algorithm (Husbands, 1994), where inheritance of acquired traits is possible, while the other is
based on Baldwin’s theory of inherited learning ability (Paul G. Hoel, 1987). The difference between both
approaches is based on whether or not the modifications made to β by the underlying BP algorithm (in
this case GDA or GDX) are encoded back to the parameter vectors (x or y ) and thus propagated to the
next generation. In the Lamarckian case the updated β
is inserted into the parameter vector, while in the Baldwinian case it is not. In our work, we test both variants,
thus we have a total of ten different algorithms that are
tested, these are:
– Standard algorithms: Basic back-propagation (GD),
GDA, GDM and GDX.
– Non-memetic BP with ES: GD-ES and GDM-ES.
– Lamarckian memetic algorithms: GDA-ES/L and
GDX-ES/L.
– Baldwinian memetic algorithms: GDA-ES/B and
GDX-ES/B.
5.4 Discussion and design choices
A reasonable question arises regarding the proposed
algorithm. In the development of an algorithm that can
automatically adjust the learning rate and momentum
coefficient for a BP training algorithm, we have introduced several new parameters related to the (1+1)-ES,
namely µ, λ, σ and ρ. Then, it might seem that the original concern arises again: is it necessary to dynamically
adjust these parameters based on the learning process
to achieve optimal performance? However, care must
be taken regarding such concerns, if not we might fall
6
into an infinite regress of algorithm design if an algorithm without parameters cannot be derived. Consider that the goal of an evolutionary algorithm is to
automate a real-world problem solving process, allowing system designers to transfer the responsibility
of determining some solution features to an independent process. This, as stated before, comes at a cost,
with a possible increase in system parameters and (for
most problems) no guarantee of optimality. However,
in practice these shortcomings are mostly overcome by
the fact that evolutionary algorithms tend to be quite
robust, achieving strong results in a wide range of parameter values (Eiben and Smith, 2003; DeJong, 2002).
The implicit assumption made when evolutionary algorithms are used is that the search process will be easier to setup and configure than the underlying problem that needs to be solved.
Returning to the problem addressed in this work,
the following design choices are made. First, population sizes µ and λ are set to 1, to maintain computational cost low with respect to standard BP variants.
Second, the Gaussian mutation step size σ is set to a
constant value to maintain a simple search process and
because the search presents a dynamic fitness landscape. Third, the step size is set σ = 0.2 since the domain of both BP parameters that are evolved, β and γ,
is basically [0, 1] 2 , thus the step size provides a good
exploratory search. Finally, the number of epochs per
generation ρ can be seen as the only new parameter
in the proposed (1+1)-ES. Therefore, the proposed experimental study will first focus on analyzing the influence of ρ on the quality of network training over
a wide range of values. It should be noted that such
an experimental validation is definitely not exhaustive
(there might be a plausible interdependence between
the (1+1)-ES parameters), but any further analysis is
left as future research. Nonetheless, the experimental
work is aimed at showing that the (1+1)-ES provides a
sufficiently robust algorithm that simplifies parameter
setting and tuning.
5.5 Comparison with previous work and design
choices
The proposal developed in this work, is one of a handful of methods that address the problem of MLP training with artificial evolution. In fact, to our knowledge,
only the work in (Kim et al., 1996) presents a similar
approach, with several noteworthy differences. First,
2 For the momentum coefficient γ this is strictly the case,
while for the learning rate β most published works suggest values within this range.
Table 1 Mackey Glass time series problem, from left to right:
dataset, network architecture and initial parameters for BP algorithm.
Time Series
Mackey Glass
Dataset
800 samples:
500/training,
300/testing
Network
1 Hidden layer
25 neurons,
3 inputs: x(t − 1)
x(t − 2), x(t − 3)
Tan-Sigmoid
activation.
Prediction
horizon: x(t).
Params
β= 0.01
γ = 0.9
Epochs = 4000
Goal = 1e-10
(Kim et al., 1996) only considers the learning rate, it
does not include the momentum coefficient. Second,
in (Kim et al., 1996) the learning rate is updated every generation by the evolutionary algorithm, thereby
not exploiting a local optimization. In fact, this configuration is contained as a particular case of our approach and is evaluated in the experimental work
presented below. However, the experimental results
strongly suggest that the best performance is achieved
by a trade-off that exploits both optimization strategies
using a memetic approach, instead of the pure evolutionary search of (Kim et al., 1996). Third, (Kim et al.,
1996) uses an evolutionary algorithm with a population of 20 or 50 individuals, thus raising the number of
different networks that must be trained concurrently.
On the other hand, by using a (1+1)-ES we are at most
doubling the total computation required to train a single network, a negligible increase for many real-world
scenarios. Finally, while (Kim et al., 1996) only presents
preliminary results on simple synthetic problems, we
consider several benchmark and real world problems
from ANN and machine learning literature.
6 Experiments and results
The aim of the experimental tests is two-fold. Firstly,
we want to estimate a range of values for which the ρ
parameter gives the best performance. Secondly, perform a series of comparative tests between the basic
BP algorithms (GD, GDA, GDM and GDX) and the
corresponding ES variants. The goal of the first set of
experiments is to examine how ρ influences the performance of the learning process and to determine a
proper range of values for which ρ gives the best performance. The second set of experiments will provide
a comparative statistical validation of the type of improvements that the hybrid algorithm yields. These experiments are carried out using three different problem types, and six datasets in total. Finally, all of the
algorithms are coded and tested using Matlab 2009a.
7
Table 2 Description of the classification problems, from left to
right: dataset, network architecture and initial parameters for BP
algorithm.
Problem
Forensic
Glass
Dataset
214 Samples
K-fold = 10
7 classes
10 attributes
Red Wine
Quality
1599 Samples
K-fold = 10
6 classes
11 attributes
Breast
Tissue
106 Samples
K-fold = 3
6 classes
9 attributes
Network
2 Hidden layers
64 neurons
7 output neurons
Log-Sigmoid
activation.
2 Hidden layers
15,11 neurons
6 output
Log-Sigmoid
activation.
2 Hidden layers
60 neurons
4 output neurons
Log-Sigmoid
activation.
Params
β = 0.5
γ = 0.5
Epochs = 4000
Goal = 1e-10
β= 0.5
γ = 0.5
Epochs = 4000
Goal = 1e-10
β = 0.5
γ = 0.5
Epochs = 4000
Goal = 1e-5
Table 3 Description of the biometric face recognition problem.
Problem
Face
Recognition
Dataset
400 Samples
K-fold = 10
40 person
10 images
each person
Network
2 Hidden layers
84,45 neurons
40 output
neurons
Params.
β= 0.01
γ = 0.9
Epochs = 1000
Goal=0.00001
6.1 Datasets and experimental setup
Three different problem types are used: time series
prediction, classification and biometric pattern recognition. With several different instances for each case.
The Mackey Glass time-delay differential equation
(Mackey and Glass, 1977) is used for time series prediction, given by
dx
0.2x(t − φ)
− 0.1x(t) .
=
dt
1 + x(t − φ)10
6.2 Effects of the number of epochs per generation on
network training
(5)
We construct three datasets by using φ values of 16,
17 and 30, where the the first produces no chaotic behavior and the last produces the most, we respectively
call each dataset MKG1, MKG2 and MKG3. The initial
conditions in each case are x(0) = 1.2 and x(t) = 0
for t < 0, and the data is generated using a Matlab
2009a implementation of the 4 − th order Runge-Kutta
method.
Additionally, three different classification problems are tested, taken from the UCI repository3 , these
are forensic glass, red wine quality, and breast tissue.
Finally, the algorithm is also tested on a recognition
problem, using the ORL face dataset (Samaria and
Harter, 2002). For face recognition, each image is given
as input to the neural network in vector form without
any preprocessing, where each is of size 92 × 112 pixels
and in greyscale.
3 UCI Machine Learning Repository
http://archive.ics.uci.edu/ml
The datasets are summarized in Tables 1, 2 and 3,
along with the neural network architecture and parameters used in each experiment. The learning rate
and momentum coefficient use the default values provided with the implementations of the training algorithms from the Neural Network Toolbox for Matlab,
except for the classification problems where they are
set based on the best performance achieved by GD on
the forensic glass problem. For GDA and GDX β∆+ =
aβ with a = 1.05 and β∆− = bβ with b = 0.7, as suggested in the Neural Network Toolbox. Additionally,
the neural network architecture was also set empirically for all problems based on the best performance
achieved by GD, except for the face recognition problem where GD performs quite poorly, so the architecture is set based on the performance of GDA. In summary, network architecture and initial BP parameters
are set based on the performance of standard BP variants or based on the suggested values from a widely
used Matlab toolbox. This provides an initial baseline
configuration that can be used to fairly compare the
proposed ES-based variants.
Note that the learning rate and momentum parameters define the initial individual of the (1+1)-ES. In all
three cases, a desired goal error is used as the primary
stopping criterion and a maximum number of epochs
as a secondary criterion, both specified in the corresponding tables referenced above.
It is imperative to analyze how the number of
epochs per generation, specified by the ρ parameter, effects the performance on learning for each algorithm. To do so, we test each algorithm on a
specific problem and vary ρ over a fixed range
of values in a discrete logarithmic scale, given by
[2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000, 2000]. This
allows us to test the effect of ρ at different orders of
magnitude. We perform this test on a dataset from
each problem type: for time series we use MKG1, for
classification we use forensic glass, and for biometric
recognition we use face recognition. Moreover, in order to obtain statistically valid results, for time series
we compute the average performance over thirty independent runs. On the other hand, for classification
and biometric recognition we use the average performance of 10-fold cross-validation as suggested by Refaeilzadeh et al. (2009). For instance, Figure 1a shows
the average classification accuracy and standard error
on the test set of forensic glass for GDM-ES over the
entire range of ρ values, presented in a log scale. In
8
Table 4 Ranges in which ρ produced the best results for each ES-based BP algorithm.
Method
ρ for
Time Series
ρ for
Classification
GD-ES
GDA-ES/B
GDA-ES/L
GDM-ES
GDX-ES/B
GDX-ES/L
50-100
70-500
70-500
2-90
90-700
60-400
200-1000
100-800
10-90
200-1000
100-1000
60-100
ρ for
Biometric
Pattern
2-10
200-500
100-500
2-9
200-500
200-500
Avg. of
offspring
Time Series
0.08-0.10
0.14-0.77
0.16-0.87
0.05-0.29
0.28-0.29
0.20-0.32
100
1
90
0.9
80
0.8
Avg. of
offspring
Classification
0.32-0.67
0.44-0.86
0.03-0.24
0.53-0.55
0.15-0.92
0.34-0.37
Avg. of
offspring
Biometric Pattern
0.13-0.68
0.30-0.38
0.20-0.25
0.32-0.80
0.42-0.70
0.25-0.40
Offspring Survival
Classification Accuracy
Mean
70
60
50
40
30
Mean
20
0
1
2
10
3
10
0.6
0.5
0.4
0.3
0.2
Mean +/− Std
10
Mean +/− Std
0.7
0.1
0
1
2
10
10
10
3
10
Epochs per Generation
Epochs per Generation
(a) Accuracy
(b) Offspring survival
Fig. 1 Performance of GDM-ES on the forensic glass classification problem.
Table 5 Comparative result for the Mackey Glass problem showing the average NRMSE and standard deviation.
Method
ρ
GD
GD-ES
GDA
GDA-ES/B
GDA-ES/L
GDM
GDM-ES
GDX
GDX-ES/B
GDX-ES/L
80
200
70
90
700
60
Error
MKG1
2.57
0.30
1.84
0.14
0.15
2.88
0.23
2.93
0.15
0.16
std
MKG1
1.18
0.15
0.80
0.03
0.07
1.19
0.09
1.43
0.05
0.07
this case the performance reaches a maximum plateau
in the range of [200−1000], which can be considered as
a proper operating range for GDM-ES. This is done for
all of the ES variants and all of the problem types, these
results are summarized in the second, third and fourth
columns of Table 4. For most of methods, in all of the
problems, the best vale for ρ is a few hundred epochs
per generation, with some exceptions that achieved
better performance with smaller ρ values. Additionally, we can count the number of times that evolution
produced improvements on the learning parameters,
which is the number of times that the offspring vec-
Error
MKG2
0.38
0.20
0.23
0.18
0.20
0.45
0.15
0.21
0.14
0.13
std
MKG2
0.18
0.11
0.14
0.09
0.11
0.26
0.06
0.12
0.08
0.07
Error
MKG3
1.87
0.30
1.25
0.25
0.20
2.04
0.27
2.07
0.19
0.21
std
MKG3
1.08
0.16
0.77
0.16
0.06
1.15
0.11
1.19
0.07
0.09
tor substituted the parent vector. This is shown in Figure 1b for the GDM-ES algorithm, the results for all
of the methods are given in the final three columns of
Table 4 that shows the minimum and maximum values obtained within the established range for ρ. Obviously, the number of times that the offspring vector
improves upon the parent depends on ρ which effectively gives the total number of times that an offspring
is generated. Therefore, the results are presented as the
proportion of times that the offspring outperformed
the parent, where a value of 1 would indicate that
the offspring always replaces the parent. For GDM-
9
Mackey−Glass Chaotic Time Series
Mackey−Glass Chaotic Time Series
1.4
1.4
1.3
1.3
1.2
1.2
1.1
1.1
1
1
x(t)
x(t)
Real Data
Predicted Data
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
500
550
600
650
Time (Sec)
700
750
800
Real Data
Predicted Data
0.4
500
550
(a) GDM
600
650
Time (Sec)
700
750
800
(b) GDM-ES
Fig. 2 Time series prediction on the test data for the MKG2 time series on GDM and GDM-ES.
Table 6 Comparative results for the classification problems, showing average classification accuracy on test data and the standard
error over the k-fold cross-validation.
Method
ρ
GD
GD-ES
GDA
GDA-ES/B
GDA-ES/L
GDM
GDM-ES
GDX
GDX-ES/B
GDX-ES/L
500
100
100
600
500
100
Acc.
Glass
34.5
77.2
35.0
74.1
63.0
36.8
77.8
35.5
77.9
66.2
std
Glass
4.3
8.7
2.1
8.0
8.9
2.4
9.2
1.8
9.3
5.6
ES, Figure 1 shows that there is a correlation between
the maximum performance achieved and the higher
proportion of instances in which the offspring outperformed the parent. This indicates that the ES is allowing the network learning process to improve upon
what a standard BP run would have achieved. Moreover, similar results were obtained for all ES variants
of BP. In the following section, we focus on determining the significance of the improvements, if any, that
are achieved by the evolutionary competition between
the parent vector and its offspring when we compare
the (1+1)-ES proposals with the standard algorithms.
6.3 Comparative tests
This section compares each of the baseline BP algorithms with their ES variants on all of the test problems, setting the ρ parameter based on the preceding
tests. On the other hand, for the standard BP algorithms the parameters are set to common values that
Acc.
Wine
48.8
48.2
46.4
50.8
53.6
50.0
51.4
50.0
50.4
50.5
std
Wine
7.2
7.1
3.9
4.1
2.8
6.7
5.3
5.5
5.4
4.4
Acc.
Breast
46.21
54.65
46.21
70.82
60.42
46.21
52.88
50.97
55.68
53.80
std
Breast
0.87
7.99
0.87
20.88
5.09
0.87
11.13
7.84
8.93
10.53
have been determined after years of theoretical and
empirical work, which is why we do not optimize
them for each specific problem. It is important to consider that parameter selection and tuning is an important open problem in machine learning, and most realworld implementations are based on an initial “good
guess” made by the system designer considering what
has previously been reported to achieve good results
in relevant literature. Indeed, the default values provided in software tools are based precisely on such an
analysis, but it is not correct to describe these parameters as random, they are set within a specific range
of values were good performance can be expected.
Nonetheless, as stated before, in general these initial
values cannot guarantee optimal performance in the
general case, so researches must usually struggle with
a tedious trial and error process to tune system parameters. Therefore, the goal of the proposal made in this
work is precisely to allow the designer to use “standard” initial values, that are then tuned automatically
through the ES-based search. On the other hand, an
10
initial “good guess” for ρ cannot be derived from previous experiences, it is therefore based on the range of
values where performance peeked on some representative test cases. Hence, for the experiments reported
below, a value within these, quite large, ranges of values is chosen for each problem. Indeed, Table 4 confirms what is commonly accepted as a desirable property of evolutionary algorithms, their robustness to
parametrization within a wide range of values. Moreover, it is important to point out that ρ is not chosen
based on the performance achieved on all problems,
only on single representative instances.
6.3.1 Time series
Table 5 shows the comparisons on each of the time
series problems, with statistics of NRMSE4 computed
over thirty runs, showing average performance and
standard error. The results show that in all cases the ES
algorithms obtained a substantial improvement over
the baseline methods. For instance, Figure 2 shows
the time series prediction for MKG2 using GDM and
GDM-ES on the test data, this illustrates the improvement in performance achieved by the ES algorithm. To
validate these results we perform a one sided t-test between each BP algorithm and its ES variant (Zimmerman and Williams, 1986). For all of the cases there is
a statistically significant improvement at the α = 0.01
significance level, except on the MKG2 time series with
GDA and GDA-ES/L. This is strong evidence that the
proposed algorithm is able to substantially improve
upon standard BP training.
In comparison with other published results, the
NRMSE for time series prediction on Mackey-Glass is
quite good. For instance, for MKG2 performance the
proposed algorithm outperforms a particle swarm algorithm as reported by Samanta (2011). However, the
method does not outperform more complex prediction
strategies that were specifically designed for time series forecasting, such as an ensemble of ANIFS (Melin
et al., 2012). Nonetheless, the goal here is only to improve the BP training process, domain specific systems
can then use the learning algorithm to design more
complex ensemble or hybrid methods.
gorithms obtained a substantial improvement over the
baseline methods. This is also apparent for the breast
tissue classification problem, particularly for the GDAES variants. However, for red wine most of the ES algorithms only achieve a minimum improvement over
the basic BP algorithms.
The statistical tests confirm these empirical observations, because according to the t-tests for the forensic
glass problem all of the ES variants achieve a statistically significant improvement at the α = 0.01 significance level. For red wine quality only the two memetic
variants of GDA-ES showed a statistically significant
improvement over GDA at the α = 0.05 significance
level, with GDA-ES/L also achieving it at α = 0.01.
In all other cases, the performance is very similar,
with all methods exhibiting a poor performance of
around 50% accuracy. For the breast tissue problem,
given the amount of available data and classes only 3fold cross validation was performed, which excludes
strong statistical testing and explains the large performance variations exhibited by two methods (GDAES/B and GDM-ES). Nonetheless, the results indicate
a noticeable improvement in average performance, especially considering the observed standard error for
the GDA-ES and GD-ES variants.
Finally, to provide an additional comparison, we
compare with the accuracy results published by other
authors on these sets. This is an informal comparison,
since the algorithms were not executed under the same
conditions. Nonetheless, it provides a rough estimate
of the quality of the results reported here. The performance achieved by GDM-ES, GDX-ES/B and GD-ES
on forensic glass compares favorably with other published results, see for instance (Denoeux, 2000; Zhong
and Fukushima, 2007). For red wine the improved ES
variants outperform low-tolerance regression methods, such as Support Vector Machines (Cortez et al.,
2009). Moreover, GDA-ES/L achieves comparable results to those of Naive Bayes and Artificial Immune
Recognition Systems, but is slightly behind Logistics
Regression and another MLP that achieves around
60% accuracy 5 according to Dogan and Tanrikulu
(2010).
6.3.3 Biometric recognition
6.3.2 Classification
For the classification problems we use k-fold crossvalidation, Table 6 shows the comparisons on each
dataset with statistics for classification accuracy, showing average performance and standard error. The results show that in the forensic glass problem the ES al4
Normalized Root Mean Square Error
Finally, we present comparative tests for biometric
recognition on the ORL dataset, these are summarized
in Table 7. In these tests our proposal achieves different results for each method. For instance, for GD and
GDM the improvements are substantial, a difference
5 The authors do not provide the network architecture, thus a
full comparison is not possible.
11
Table 7 Comparative results for the face recognition problem.
Method
GD
GD-ES
GDA
GDA-ES/B
GDA-ES/L
GDM
GDM-ES
GDX
GDX-ES/B
GDX-ES/L
ρ
3
400
400
2
300
300
Recognition %
3.45
67
81.15
78.75
80.1
3
65.8
79
79.65
79.45
std
1.1654
2.8087
2.6357
3.9033
3.5103
0.9128
4.2242
2.6457
4.6430
3.1837
of one order of magnitude in performance, and the ttests confirm this above the α = 0.01 significance level.
On the other hand, for GDA and GDX the performance
is basically the same with and without ES, and the ttests confirms it. It appears that for this problem the
learning process must be able to change the learning
rate at almost every epoch, thus the good performance
by GDA and GDX which is comparable with other approaches (Ayinde and Yang, 2002). Moreover, notice
that GD and GDM have an extremely low accuracy,
while their ES variants require a small value for ρ and
thus a frequent update of the learning rate, see Table 4.
However, the random modifications offered by the ES
algorithm cannot achieve the same performance than
the deterministic heuristic followed by GDA, when we
compare GD with GD-ES and GD with GDA.
problems, but will probably perform worse in other
cases. However, the results presented here are encouraging, considering that the ES variants mostly outperform simple BP algorithms, while Cantú-Paz and Kamath (2005) showed that in their experiments the standard BP algorithms actually performed better in many
test cases. It appears that the unconstrained adaptations of the BP learning parameters provided by the
(1+1)-ES does indeed allow the learning process to escape local optima and find a better overall optimization of network connection weights.
Future work derived from this paper should first
focus on further validating the algorithm in other
application domains. Moreover, in principle the proposed strategy could be used to enhance other learning algorithms, even those based on EC techniques,
thus providing a self tuning or self adapting mechanism, which is a prerequisite for the development of
truly autonomous learning systems.
Acknowledgements
First author was supported by scholarship 263888 from Consejo
Nacional de Ciencia y Tecnologı́a (CONACYT) of México. Corresponding author also thanks the Departamento de Ingenierı́a
Eléctrica y Electrónica at the Instituto Tecnológico de Tijuana.
Additionally, partial funding for this work was given by CONACYT (Mexico) Basic Science Research Grant No. 178323.
7 Discussion and Conclusions
References
This paper presents a hybrid approach towards neural network training that combines the standard backpropagation algorithm with a (1+1) Evolutionary
Strategy that adaptively modifies the main learning
parameters. The algorithm can easily accommodate
any of the standard BP variants, and can thus perform a memetic search when coupled with an adaptive learning rate method, such as GDA or GDX. The
proposed algorithm was tested on various benchmark
problems and compared with standard methods. The
results show that the overall performance of an MLP
in many cases is increased with the hybrid learning approach, particularly for time series prediction and classification problems. In fact, even in the worst test cases,
the performance of the proposed algorithm was never
substantially worse than standard methods. We do not
claim that the proposed algorithm will achieve the best
performance on all possible tests, indeed the many
degrees of freedom these algorithms present coupled
with well-known theoretical (Wolpert and Macready,
1997) and experimental (Cantú-Paz and Kamath, 2005)
results, should lead us towards the conclusion that the
proposal can outperform other algorithms on some
Alba, E., Chicano, J. F., 2004. Training neural networks
with ga hybrid algorithms. In: Deb, K. et al. (Eds.),
GECCO (1). Vol. 3102 of Lecture Notes in Computer
Science. Springer, pp. 852–863.
Ayinde, O., Yang, Y., 2002. Face recognition approach
based on rank correlation of gabor-filtered images.
Pattern Recognition 36 (6), 1275–1289.
Cantú-Paz, E., Kamath, C., 2005. An empirical comparison of combinations of evolutionary algorithms and
neural networks for classification problems. IEEE
Systems, Man, and Cybernetics Society 35 (5), 915
– 927.
Castillo, O., Melin, P., 2002. Hybrid intelligent systems
for time series prediction using neural networks,
fuzzy logic, and fractal theory. IEEE Transactions
Neural Network. 13 (6), 1395–1408.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis,
J., 2009. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47 (4), 547–553.
DeJong, K., 2002. Evolutionary Computation: A Unified Approach. The MIT Press.
12
Denoeux, T., 2000. A neural network classifier based
on dempster-shafer theory. IEEE Systems, Man, and
Cybernetics Society 30 (2), 131 – 150.
Dogan, N., Tanrikulu, Z., 2010. A comparative framework for evaluating classification algorithms. In:
Proceedings of the World Congress on Engineering
2010. Vol. 1. WCE, pp. 379–384.
Dorigo, M., Stützle, T., 2004. Ant Colony Optimization.
Bradford Company, Scituate, MA, USA.
Eiben, A. E., Smith, J. E., 2003. Introduction to Evolutionary Computing. SpringerVerlag.
Fogel, D. B., Fogel, L. J., Porto, V. W., 1990. Evolving
neural networks. Biol. Cybern. 63 (6), 487–493.
Hagan, M., B., D. H., Beale, M., 1996. Neural Network Design. PWS Publishing Company, Boston,
MA, USA.
Harp, S. A., Samad, T., Guha, A., 1989. Towards the
genetic synthesis of neural network. In: Proceedings
of the third international conference on Genetic algorithms. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, pp. 360–369.
Husbands, P., 1994. Distributed coevolutionary genetic
algorithms for multi-criteria and multi-constraint
optimisation. In: Selected Papers from AISB Workshop on Evolutionary Computing. Springer-Verlag,
London, UK, pp. 150–165.
Isasi Viñuela, P., 2004. Redes Neuronales Artificiales:
Un Enfoque Práctico. TPearson Educacion.
Kim, H. B., Jung, S. H., Kim, T. G., Park, K. H., 1996.
Fast learning method for back-propagation neural network by evolutionary adaptation of learning
rates. Neurocomputing 11 (1), 101 – 106.
Kiranyaz, S., Ince, T., Yildirim, A., Gabbouj, M., 2009.
Evolutionary artificial neural networks by multidimensional particle swarm optimization. Neural
Netw. 22 (10), 1448–1462.
Kirkpatrick, S., Gelatt, C. D., Vecchi, M. P., 1983. Optimization by simulated annealing. Science 220 (4598),
671–680.
Lee, S.-W., 1996. Off-line recognition of totally unconstrained handwritten numerals using multilayer
cluster neural network. IEEE Trans. Pattern Anal.
Mach. Intell. 18 (6), 648–652.
Mackey, M., Glass, L., 1977. Oscillation and chaos in
physiological control systems. Science 197 (4300),
287–289.
Melin, P., Castillo, O., 2005. Hybrid Intelligent Systems
for Pattern Recognition Using Soft Computing: An
Evolutionary Approach for Neural Networks and
Fuzzy Systems (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc., Secaucus,
NJ, USA.
Melin, P., Soto, J., Castillo, O., Soria, J., 2012. A new
approach for time series prediction using ensembles
of anfis models. Expert Syst. Appl. 39 (3), 3494–3506.
Merelo, J., Patón, M., Cañas, A., Prieto, A., F., M.,
1993. Optimization of a competitive learning neural network by genetic algorithms. In: Proceedings
of the Int. Workshop Artificial Neural Networks
(IWANN93). Vol. 686 of Lecture Notes in Computer
Science. Morgan Kaufmann Publishers Inc., Berlin,
Germany, pp. 185–192.
Miller, G. F., Todd, P. M., Hegde, S. U., 1989. Designing
neural networks using genetic algorithms. In: Proceedings of the third international conference on Genetic algorithms. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, pp. 379–384.
Nolfi, S., Floreano, D., 2000. Evolutionary Robotics:
The Biology,Intelligence,and Technology. MIT Press,
Cambridge, MA, USA.
Patel, D., 1996. Using genetic algorithms to construct a
network for financial prediction. In: Proceedings of
SPIE: Applications of Artificial Neural Networks in
Image Processing. pp. 204–213.
Paul G. Hoel, Sidney C. Port, C. J. S., 1987. Introduction
to Stochastic Processes. Waveland Press.
Radi, A., Poli, R., 2003. Discovering efficient learning
rules for feedforward neural networks using genetic
programming. Physica-Verlag GmbH, Heidelberg,
Germany, Germany, Ch. 7, pp. 133–159.
Refaeilzadeh, P., Tang, L., Liu, H., 2009. Crossvalidation. In: Liu, L., Özsu, M. T. (Eds.), Encyclopedia of Database Systems. Springer US, pp. 532–538.
Riedmiller, M., 1994. Rprop - description and implementation details. Tech. rep., University of Karlsruhe.
Rumelhart, D. E., Hinton, G. E., Williams, R. J., 1986.
Learning internal representations by error propagation. MIT Press, Cambridge, MA, USA, Ch. 8, pp.
318–362.
Samanta, B., 2011. Prediction of chaotic time series using computational intelligence. Expert Syst. Appl.
38 (9), 11406–11411.
Samarasinghe, S., 2006. Neural Networks for Applied
Sciences and Engineering. Auerbach Publications,
Boston, MA, USA.
Samaria, F. S., Harter, A. C., 2002. Parameterisation
of a stochastic model for human face identification.
In: Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on. Sarasota, FL,
pp. 138–142.
Schwefel, H.-P., 1981. Numerical Optimization of
Computer Models. John Wiley & Sons, Inc., New
York, NY, USA.
13
Stanley, K. O., D’Ambrosio, D. B., Gauci, J., 2009. A
hypercube-based encoding for evolving large-scale
neural networks. Artif. Life 15 (2), 185–212.
Stanley, K. O., Miikkulainen, R., 2002. Evolving neural networks through augmenting topologies. Evol.
Comput. 10 (2), 99–127.
Wolpert, D. H., Macready, W. G., 1997. No free lunch
theorems for optimization. Trans. Evol. Comp 1 (1),
67–82.
Yao, X., 1999. Evolving artificial neural networks. Proceedings of IEEE 87 (9), 1423 – 1447.
Zhang, G., 2000. Neural networks for classification: a
survey. IEEE Systems, Man, and Cybernetics Society
30 (4), 451 – 462.
Zhong, P., Fukushima, M., 2007. Regularized nonsmooth newton method for multi-class support
vector machines. Optimization Methods Software
22 (1), 225–236.
Zimmerman, D. W., Williams, R. H., 1986. Modern
elementary statistics, with theoretical supplement
and BASIC programming. American Sciences Press,
Syracuse, NY, USA.
View publication stats