Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

BOA The Bayesian Optimization Algorithm

Uploaded by

cantupaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

BOA The Bayesian Optimization Algorithm

Uploaded by

cantupaz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

BOA: The Bayesian Optimization Algorithm

Martin Pelikan, David E. Goldberg, and Erick Cantu-Paz


Illinois Genetic Algorithms Laboratory
Department of General Engineering
University of Illinois at Urbana-Champaign
fpelikan,deg,cantupazg@illigal.ge.uiuc.edu
Abstract tion of multinomial data by Bayesian networks in or-
der to generate new solutions. The proposed algorithm
In this paper, an algorithm based on the extends existing methods in order to solve more di-
concepts of genetic algorithms that uses an cult classes of problems more eciently and reliably.
estimation of a probability distribution of By covering interactions of higher order, the disrup-
promising solutions in order to generate new tion of identied partial solutions is prevented. Prior
candidate solutions is proposed. To esti- information from various sources can be used. The
mate the distribution, techniques for model- combination of information from the set of good so-
ing multivariate data by Bayesian networks lutions and the prior information about a problem is
are used. The proposed algorithm identies, used to estimate the distribution. Preliminary experi-
reproduces and mixes building blocks up to ments with uniformly-scaled additively decomposable
a specied order. It is independent of the problems with non-overlapping building blocks indi-
ordering of the variables in the strings rep- cate that the proposed algorithm is able to solve all
resenting the solutions. Moreover, prior in- tested problems in close to linear time with respect to
formation about the problem can be incor- the number of tness evaluations until convergence.
porated into the algorithm. However, prior In Section 2, the background needed to understand
information is not essential. Preliminary ex- the motivation and basic principles of the discussed
periments show that the BOA outperforms methods is provided. In Section 3, the Bayesian op-
the simple genetic algorithm even on decom- timization algorithm (BOA) is introduced. In subse-
posable functions with tight building blocks quent sections, the structure of Bayesian networks and
as a problem size grows. the techniques used in the BOA to construct the net-
work for a given data set and to use the constructed
network to generate new instances are described. The
1 INTRODUCTION results of the experiments are presented in Section 6.
The conclusions are provided in Section 7.
Recently, there has been a growing interest in opti-
mization methods that explicitly model the good so-
lutions found so far and use the constructed model to
2 BACKGROUND
guide the further search (Baluja, 1994 Harik et al., Genetic algorithms (GAs) are optimization methods
1997 Muhlenbein & Paa, 1996 Muhlenbein et al., loosely based on the mechanics of articial selection
1998 Pelikan & Muhlenbein, 1999). This line of re- and genetic recombination operators. Most of the the-
search in stochastic optimization was strongly moti- ory of genetic algorithms deals with the so-called build-
vated by results achieved in the eld of evolutionary ing blocks (BBs) (Goldberg, 1989). By building blocks,
computation. However, the connection between these partial solutions of a problem are meant. The ge-
two areas has sometimes been obscured. Moreover, the netic algorithm implicitly manipulates a large number
capabilities of model building have often been insu- of building blocks by mechanisms of selection and re-
ciently powerful to solve hard optimization problems. combination. It reproduces and mixes building blocks.
The purpose of this paper is to introduce an algorithm However, a xed mapping from the space of solutions
that uses techniques for estimating the joint distribu- into the internal representation of solutions in the al-
gorithm and simple two-parent recombination opera- the old ones. The process is repeated until the ter-
tors soon showed to be insuciently powerful even for mination criteria are met. However, estimating the
problems that are composed of simpler partial sub- distribution is not an easy task. There is a trade o
problems. General, xed, problem-independent re- between the accuracy of the estimation and its com-
combination operators often break partial solutions putational cost.
what can sometimes lead to losing these and converg- The simplest way to estimate the distribution of
ing to a local optimum. Two crucial factors of the GA good solutions is to consider each variable in a
success|a proper growth and mixing of good building problem independently and generate new solutions
blocks|are often not achieved (Thierens, 1995). Var- by only preserving the proportions of the values of
ious attempts to prevent the disruption of important all variables independently of the remaining solu-
building blocks have been done recently and are briey tions. This is the basic principle of the population
discussed in the remainder of this section. based incremental learning (PBIL) algorithm (Baluja,
There are two major approaches to resolve the prob- 1994), the compact genetic algorithm (cGA) (Harik
lem of building-block disruption. The rst approach et al., 1997), and the univariate marginal distribu-
is based on manipulating the representation of solu- tion algorithm (UMDA) (Muhlenbein & Paa, 1996).
tions in the algorithm in order to make the interact- There is theoretical evidence that the UMDA ap-
ing components of partial solutions less likely to be proximates the behavior of the simple GA with uni-
broken by recombination operators. Various reorder- form crossover (Muhlenbein, 1997). It reproduces and
ing and mapping operators were used. However, re- mixes the building blocks of order one very eciently.
ordering operators are often too slow and lose the race The theory of UMDA based on the techniques of quan-
against selection, resulting in premature convergence titative genetics can be found in Muhlenbein (1997).
to low-quality solutions. Reordering is not suciently Some analyses of PBIL can be found in Kvasnicka et al.
powerful in order to ensure a proper mixing of partial (1996).
solutions before these are lost. This line of research The PBIL, cGA, and UMDA algorithms work very well
has resulted in algorithms which evolve the represen- for problems with no signicant interactions among
tation of a problem among individual solutions, e.g. variables (Muhlenbein, 1997 Harik et al., 1997 Pe-
the messy genetic algorithm (mGA) (Goldberg et al., likan & Muhlenbein, 1999). However, partial solu-
1989), the gene expression messy genetic algorithm tions of order more than one are disrupted and there-
(GEMGA) (Kargupta, 1998), the linkage learning ge- fore these algorithms experience a great diculty to
netic algorithm (LLGA) (Harik & Goldberg, 1996), or solve problems with interactions among the variables.
the linkage identication by non-linearity checking ge- First attempts to solve this problem were based on
netic algorithm (LINC-GA) (Munetomo & Goldberg, covering some pairwise interactions, e.g. the incre-
1998). mental algorithm using the so-called dependency trees
A dierent way to cope with the disruption of partial as a distribution estimate (Baluja & Davies, 1997),
solutions is to change the basic principle of recombina- the population-based MIMIC algorithm using simple
tion. In the second approach, instead of implicit repro- chain distributions (De Bonet et al., 1997), or the bi-
duction of important building blocks and their mixing variate marginal distribution algorithm (BMDA) (Pe-
by selection and two-parent recombination operators, likan & Muhlenbein, 1999). In the algorithms based
new solutions are generated by using the information on covering pairwise interactions, the reproduction of
extracted from the entire set of promising solutions. building blocks of order one is guaranteed. Moreover,
Global information about the set of promising solu- the disruption of some important building blocks of
tions can be used to estimate their distribution and order two is prevented. Important building blocks of
new candidate solutions can be generated according order two are identied using various statistical meth-
to this estimate. A general scheme of the algorithms ods. Mixing of building blocks of order one and two is
based on this principle is called the estimation of distri- guaranteed assuming the independence of the remain-
bution algorithm (EDA) (Muhlenbein & Paa, 1996). ing groups of variables.
In EDAs, better solutions are selected from an ini- However, covering only pairwise interactions has been
tially randomly generated population of solutions like shown to be insucient to solve problems with interac-
in the simple GA. The distribution of the selected set tions of higher order eciently (Pelikan & Muhlenbein,
of solutions is estimated. New solutions are generated 1999). Covering pairwise interactions still does not
according to this estimate. The new solutions are then preserve higher order partial solutions. Moreover, in-
added into the original population, replacing some of teractions of higher order do not necessarily imply
pairwise interactions that can be detected at the level 3 BAYESIAN OPTIMIZATION
of partial solutions of order two. ALGORITHM
In the factorized distribution algorithm (FDA)
(Muhlenbein et al., 1998), a factorization of the distri- This section introduces an algorithm that uses tech-
bution is used for generating new solutions. The distri- niques for modeling data by Bayesian networks to es-
bution factorization is a conditional distribution con- timate the joint distribution of promising solutions
structed by analyzing the problem decomposition. The (strings). This estimate is used to generate new can-
FDA is capable of covering the interactions of higher didate solutions. The proposed algorithm is called the
order and combining important partial solutions eec- Bayesian optimization algorithm (BOA). The BOA
tively. It works very well on additively decomposable covers both the UMDA as well as BMDA and extends
problems. The theory of UMDA can be used in order them to cover the interactions of higher order. The or-
to estimate the time to convergence in the FDA. der of interactions that will be taken into account can
be given as input to the algorithm. The combination
However, the FDA requires the prior information of prior information and the set of promising solutions
about the problem in the form of a problem decompo- is used to estimate the distribution. Prior information
sition and its factorization. As an input, this algorithm about the structure of a problem as well as the infor-
gets a complete or approximate information about the mation represented by the set of high-quality solutions
structure of a problem. Unfortunately, the exact dis- can be incorporated into the algorithm. The ratio be-
tribution factorization is often not available without tween the prior information and the information ac-
computationally expensive problem analysis. More- quired during the run used to generate new solutions
over, the use of an approximate distribution according can be controlled. The BOA lls the gap between the
to the current state of information represented by the fully informed FDA and totally uninformed black-box
set of promising solutions can be very eective even if optimization methods. Prior information is not essen-
it is not a valid distribution factorization. However, tial.
by providing sucient conditions for the distribution
estimate that ensure a fast and reliable convergence In the BOA, the rst population of strings is gener-
on decomposable problems, the FDA is of great the- ated at random. From the current population, the
oretical value. Moreover, for problems of which the better strings are selected. Any selection method can
factorization of the distribution is known, the FDA is be used. A Bayesian network that ts the selected
a very powerful optimization tool. set of strings is constructed. Any metric as a mea-
sure for quality of networks and any search algorithm
The algorithm proposed in this paper is also capable of can be used to search over the networks in order to
covering higher order interactions. It uses techniques maximize the value of the used metric. New strings
from the eld of modeling data by Bayesian networks are generated using the joint distribution encoded by
in order to estimate the joint distribution of promising the constructed network. The new strings are added
solutions. The class of distributions that are consid- into the old population, replacing some of the old ones.
ered is identical to the class of conditional distribu- The pseudo-code of the BOA follows:
tions used in the FDA. Therefore, the theory of the
FDA can be used in order to demonstrate the power The Bayesian Optimization Algorithm (BOA)
of the proposed algorithm to solve decomposable prob-
lems. However, unlike the FDA, our algorithm does (1) set t  0
not require any prior information about the problem. randomly generate initial population P (0)
It discovers the structure of a problem on the y. It (2) select a set of promising strings S (t) from P (t)
identies, reproduces and mixes building blocks up to
a specied order very eciently. (3) construct the network B using a chosen metric and
constraints
In this paper, the solutions will be represented by bi- (4) generate a set of new strings O(t) according to the
nary strings of xed length. However, the described joint distribution encoded by B
techniques can be easily extended for strings over any
nite alphabet. String positions will be numbered se- (5) create a new population P (t +1) by replacing some
quentially from left to right, starting with the posi- strings from P (t) with O(t)
tion 0. set t  t + 1
(6) if the termination criteria are not met, go to (2)
In the following section, Bayesian networks and the
techniques for their construction and use will be de- order to nd the one (or a set of networks) with the
scribed. value of a scoring metric as high as possible. The space
of networks can be reduced by constraint operators.
4 BAYESIAN NETWORKS Commonly used constraints restrict the networks to
have at most k incoming edges into each node. This
Bayesian networks (Howard & Matheson, 1981 Pearl, number directly inuences the complexity of both the
1988) are often used for modeling multinomial data network construction as well as its use for generation
with both discrete and continuous variables. A of new instances and the order of interactions that can
Bayesian network encodes the relationships between be covered by the class of networks restricted in this
the variables contained in the modeled data. It repre- way.
sents the structure of a problem. Bayesian networks
can be used to describe the data as well as to generate 4.1.1 Bayesian Dirichlet metric
new instances of the variables with similar properties As a measure of the quality of networks, the so-called
as those of given data. Each node in the network cor- Bayesian Dirichlet (BD) metric (Heckerman et al.,
responds to one variable. By Xi , both the variable and 1994) can be used. This metric combines the prior
the node corresponding to this variable will be denoted knowledge about the problem and the statistical data
in this text. Each variable corresponds to one position from a given data set. The BD metric for a network
in strings representing the solutions (Xi corresponds B given a data set D of size N , and the background
to the ith position in a string). The relationship be- information  , denoted by p(D B j ), is dened as
tween two variables is represented by an edge between
the two corresponding nodes. The edges in Bayesian YY
n;1
m0 (Xi )!
networks can be either directed or undirected. In this p(D B j ) = p(B j ) (m0 (Xi ) + m(Xi ))! 
paper, only Bayesian networks represented by directed i=0 Xi
acyclic graphs will be considered. The modeled data Y (m0(xi  Xi ) + m(xi  Xi ))!
sets will be dened within discrete domains. m0 (xi  Xi )! 
xi
Mathematically, an acyclic Bayesian network with di-
rected edges encodes a joint probability distribution.
This can be written as where p(B j ) is the prior probability of the network
B , the product over Xi runs over all instances of the
Y
n;1 parents of Xi and the product over xi runs over all
p(X ) = p(Xi jXi ) (1) instances of Xi . By m(Xi ), the number of instances in
i=0 D with variables Xi (the parents of Xi ) instantiated
where X = (X0  : : :  Xn;1) is a vector of variables, to Xi is denoted. When the set Xi is empty, there
Xi is the set of parents of Xi in the network (the set is one instance of Xi and the number of instances
of nodes from which there exists an edge to Xi ) and with Xi instantiated to this instance is set to N . By
p(Xi jXi ) is the conditional probability of Xi condi- m(xi  Xi ), we denote the number of instances in D
tioned on the variables Xi . This distribution can be that have both Xi set to xi as well as Xi set to Xi .
used to generate new instances using the marginal and By numbers m0 (xi  Xi ) and p(B j ), prior informa-
conditional probabilities in a modeled data set. tion about the problem is incorporated into the metric.
The following sections discuss how to learn the network The m0 (xi  Xi ) stands for prior information about the
structure if this is not given by the user, and how to number of instances that have Xi set to xi and the set
of variables Xi is instantiated to Xi . The prior prob-
use the network to generate new candidate solutions. ability p(B j ) of the network reects how the measured
network resembles the prior network. By using a prior
4.1 CONSTRUCTING THE NETWORK network, the prior information about the structure of
There are two basic components of the algorithms a problem is incorporated into the metric. The prior
for learning the network structure (Heckerman et al., network can be set to an empty network, when there
1994). The rst one is a scoring metric and the sec- is no such information. In our implementation, we
ond one is a search procedure. A scoring metric is set p(B j ) = 1 for all networks, i.e. all networks are
a measure of how well the network models the data. treated equally.
Prior knowledge about the problem can be incorpo- The numbers m0 (xi  Xi ) can be set in various ways.
rated into the metric as well. A search procedure is They can be set according to the prior information the
used to explore the space of all possible networks in user has about the problem. When there is no prior in-
formation, uninformative assignments can be used. In used (Heckerman et al., 1994). A simple greedy algo-
the so-called K2 metric, for instance, the m0 (xi  Xi ) rithm, local hill-climbing, or simulated annealing can
coecients are all simply set to 1 (Heckerman et al., be used. Simple operations that can be performed on
1994). This assignment corresponds to having no prior a network include edge additions, edge reversals, and
information about the problem. In the empirical part edge removals. Each iteration, an operation that in-
of this paper we will use the K2 metric. creases the network the most is applied. Only opera-
Since the factorials in Equation 2 can grow to huge tions that keep the network acyclic and with at most k
numbers, usually a logarithm of the scoring metric is incoming edges into each of the nodes are allowed (i.e.,
used. The contribution of one node to the logarithm the operations that do not violate the constraints).
of the metric can be computed in O(2k N ) steps where The algorithms can start with an empty network, the
k is the maximal number of incoming edges into each best network with one incoming edge into each node
node in the network and N is the size of the data at maximum, or a randomly generated network.
set (the number of instances). The computation of an In our implementation, we have used a simple greedy
increase of the logarithm of the value of the BD metric algorithm with only edge additions allowed. The algo-
for an edge addition, edge reversal, or an edge removal, rithm starts with an empty network. The time com-
respectively, can be computed in time O(2k N ) since plexity of this algorithm can be computed using the
the total sum contribution corresponding to the nodes time complexity of a simple edge addition and the
of which the set of parents has not changed remains number of edges that have to be processed at most.
unchanged as well. Assuming that k is constant, we With the BD metric, the overall time to construct
get linear time complexity of the computation of both the network using the described greedy algorithm is
the contribution of one node as well as the increase in O(k2k n2 N + kn3 ). Assuming that k is constant, we
the metric for an edge addition O(N ) with respect to get the overall time complexity O(n2 N + n3 ).
the size of the data set.
4.2 GENERATING NEW INSTANCES
4.1.2 Searching for a Good Network In this section, the generation of new instances using
In this section, the basic principles of algorithms that a network B and the marginal frequencies for few sets
can be used for searching over the networks in order to of variables in the modeled data set will be described.
maximize the value of a scoring metric are described. New instances are generated using the joint distribu-
Only the classes of networks with restricted number of tion encoded by the network (see Equation 1).
incoming edges denoted by k will be considered. First, the conditional probabilities of each possible in-
a) k = 0 stance of each variable given all possible instances of
This case is trivial. An empty network is the best one its parents in a given data set are computed. The con-
(and the only one possible). ditional probabilities are used to generate each new
instance. Each iteration, the nodes whose parents are
b) k = 1 already xed are generated using the corresponding
For k = 1, there exists a polynomial algorithm for the conditional probabilities. This is repeated until the
network construction (Heckerman et al., 1994). The values of all variables are generated. Since the net-
problem can be easily reduced to a special case of the work is acyclic, it is easy to see that the algorithm is
so-called maximal branching problem for which there dened well.
exists a polynomial algorithm (Edmonds, 1967). The time complexity of generating an instance of all
c) k > 1 variables is bounded by O(kn) where n is the number
For k > 1 the problem gets much more complicated. of variables. Assuming that k is constant, the overall
Although for k = 1 there exists a polynomial algo- time complexity is O(n).
rithm for nding the best network, for k > 1 the prob-
lem of determining the best network with respect to
a given metric is NP-complete for most Bayesian and
5 DECOMPOSABLE FUNCTIONS
non-Bayesian metrics (Heckerman et al., 1994). A function is additively decomposable of a certain or-
Various algorithms can be used in order to nd a der if it can be written as the sum of simpler functions
good network, from a total enumeration to a blind dened over the subsets of variables, each of cardi-
random search. Usually, due to their eectiveness in nality less or equal than the order of the decomposi-
this context, simple local search based methods are tion (Muhlenbein et al., 1998 Pelikan & Muhlenbein,
1999). The problems dened by this class of functions they can be mapped onto a string (the inner repre-
can be decomposed into smaller subproblems. How- sentation of a solution) so that the variables from one
ever, simple GAs experience a great diculty to solve set are either mapped close to each other or spread
these decomposable problems with deceptive building throughout the whole string. Each variable will be re-
blocks when these are not mapped tightly onto the quired to contribute to the function through some of
strings representing the solutions (Thierens, 1995). the subfunction. A function composed in this fashion
In general, the BOA with k  0 can cover interac- is clearly additively decomposable of the order of the
tions or order k + 1. This actually does not mean subfunctions it was composed with.
that all interactions in a problem that is order-(k + 1) A deceptive function of order 3, denoted by
decomposable can be covered (e.g., 2D spin-glass sys- 3-deceptive, is dened as
tems (Muhlenbein et al., 1998)). There is no straight- 8 0:9
forward way to relate general decomposable prob- >< if u = 0
lems and what are the necessary interactions to be f3deceptive (u) = > 00:8 if u = 1
if u = 2 (3)
taken into account (or, what is the order of building :1
blocks). By introducing overlapping among the sets otherwise
from the decomposition along with scaling of the con- where u is the number of one's in an input string.
tributions of each of these sets according to some func-
tion of problem size, the problem becomes very com- A trap function of order 5, denoted by trap-5, is de-
plex. Nevertheless, the class of distributions the BOA ned as
uses is very powerful the decomposable problems with 
either overlapping or non-overlapping building blocks ftrap5 (u) = 45 ; u ifotherwise
u<5 (4)
or a bounded order. This has been conrmed by a
number of experiments with various test functions.
A bipolar deceptive function of order 6, denoted by
6 EXPERIMENTS 6-bipolar, is dened with the use of the 3-deceptive
function as follows
The experiments were designed in order to show the f6bipolar (u) = f3deceptive (j3 ; uj) (5)
behavior of the proposed algorithm only on non-
overlapping decomposable problems with uniformly 6.2 RESULTS OF THE EXPERIMENTS
scaled deceptive building blocks. For all problems, the
scalability of the proposed algorithm is investigated. For all problems, the average number of tness eval-
In the following sections, the functions of unitation uations until convergence in 30 independent runs is
used in the experiments will be described and the re- shown. For the 3-deceptive and trap-5 functions, the
sults of the experiments will be presented. population is said to have converged when the propor-
tion of some value on each position reaches 95%. This
6.1 FUNCTIONS OF UNITATION criterion of convergence is applicable only for prob-
lems with at most one global optimum and selection
A function of unitation is a function whose value de- schemes that do not force the algorithm to preserve
pends only on the number of ones in a binary input the diversity in a population (e.g. niching methods).
string. The function values for the strings with the For the 6-bipolar function, the population is said to
same number of ones are equal. Several functions of have converged when it contains over a half of opti-
unitation can be additively composed in order to form mal solutions. For all algorithms, the population sizes
a more complex function. Let us have a function of for all problem instances have been determined empiri-
unitation fk dened for strings of length k. Then, the cally as a minimal size so that the algorithms converge
function additively composed of l functions fk is de- to the optimum in all of 30 independent runs. In all
ned as runs, the truncation selection with  = 50% has been
X
l;1 used (the better half of the solutions is selected). O-
f (X ) = fk (Si ) (2) spring replace the worse half of the old population.
i=0 The crossover rate for the simple GA has been empir-
ically determined for each problem with one problem
where X is the set of n variables and Si for i 2 instance. In the simple GA, the best results have been
f0 : : :  l ; 1g are subsets of k variables from X . Sets achieved with the probability of crossover 100%. The
Si can be either overlapping or non-overlapping and probability of ipping a single bit by mutation has
800000 500000

700000 450000
BOA (k=2, K2 metric) BOA (k=5, K2 metric)
GA (one-point) 400000 GA (one-point)
Number of fitness evaluations

Number of fitness evaluations


600000
350000
500000 300000
400000 250000

300000 200000
150000
200000
100000
100000 50000
0 0
40 60 80 100 120 140 160 180 40 60 80 100 120 140 160 180
Size of the problem Size of the problem

Figure 1: Results for 3-deceptive Function. Figure 3: Results for 6-bipolar Function.
700000
the problem size grows. For loose building blocks, the
600000 BOA (k=4, K2 metric) simple GA with one-point crossover would require the
number of tness evaluations growing exponentially
GA (one-point)
Number of fitness evaluations

with the size of a problem (Thierens, 1995). On the


500000

400000 other hand, the BOA would perform the same since
it is independent of the variable ordering in a string.
300000
The population sizes for the GA ranged from N = 400
200000 for n = 30 to N = 7700 for n = 180. The population
sizes for the BOA ranged from N = 1000 for n = 30
100000
to N = 7700 for n = 180.
0
40 60 80 100 120 140 160 180 In Figure 2, the results for the trap-5 function are
Size of the problem presented. The building blocks are non-overlapping
and they are again mapped tightly onto a string. The
Figure 2: Results for trap-5 Function. results for this function are similar to those for the
3-deceptive function. The population sizes for the GA
ranged from N = 600 for n = 30 to N = 8100 for
been set to 1%. In the BOA, no prior information but n = 180. The population sizes for the BOA ranged
the maximal order of interactions to be considered has from N = 1300 for n = 30 to N = 11800 for n = 180.
been incorporated into the algorithm. In Figure 3, the results for the 6-bipolar function are
In Figure 1, the results for the 3-deceptive function presented. The results for this function are similar
are presented. In this function, the deceptive building to those for the 3-deceptive function. In addition to
blocks are of order 3. The building blocks are non- the faster convergence, then BOA discovers a number of
overlapping and mapped tightly onto strings. There- solutions out of totally 2 6 global optima of 6-bipolar
fore, one-point crossover is not likely to disrupt them. function instead of converging into a single solution.
The looser the building blocks would be, the worse the This eect could be further magnied by using niching
simple GA would perform. Since the building blocks methods. The population sizes for the GA ranged from
are deceptive, the computational requirements of the N = 360 for n = 30 to N = 4800 for n = 180. The
simple GA with uniform crossover and the BOA with population sizes for the BOA ranged from N = 900
k = 0 (i.e., the UMDA) grow exponentially and there- for n = 30 to N = 5000 for n = 180.
fore we do not present the results for these algorithms.
Some results for BMDA can be found in Pelikan and
Muhlenbein (1999). The BOA with k = 2 and the K2 7 CONCLUSIONS
metric performs the best of the compared algorithms
in terms of the number of functions evaluations until The experiments have shown that the proposed algo-
successful convergence. The simple GA with one-point rithm outperforms the simple GA even on decompos-
crossover performs worse than the BOA with k = 2 as able problems with tight building blocks as the prob-
lem size grows. The gap between the proposed al- Addison-Wesley.
gorithm and the simple GA would signicantly en- Goldberg, D. E., Korb, B., & Deb, K. (1989). Messy
large for problems with loose building blocks. For genetic algorithms: Motivation, analysis, and rst
loose mapping the time requirements of the simple results. Complex Systems , 3 (5), 493{530.
GA grow exponentially with the problem size. On Harik, G. R., & Goldberg, D. E. (1996). Learning link-
the other hand, the BOA is independent of the order- age. Foundations of Genetic Algorithms 4 , 247{262.
ing of the variables in a string and therefore changing Harik, G. R., Lobo, F. G., & Goldberg, D. E. (1997).
this would not aect the performance of the algorithm. The compact genetic algorithm (IlliGAL Report No.
The proposed algorithm works very well also for other 97006). Urbana: University of Illinois at Urbana-
problems with highly overlapping building blocks, e.g. Champaign.
spin-glasses, that are not discussed in this paper. Heckerman, D., Geiger, D., & Chickering, M. (1994).
Learning Bayesian networks: The combination of
knowledge and statistical data (Technical Report
Acknowledgments MSR-TR-94-09). Redmond, WA: Microsoft Re-
search.
The authors would like to thank Heinz Muhlenbein, Howard, R. A., & Matheson, J. E. (1981). Inuence dia-
David Heckerman, and Ole J. Mengshoel for valuable grams. In Howard, R. A., & Matheson, J. E. (Eds.),
discussions and useful comments. Martin Pelikan was Readings on the Principles and Applications of Deci-
supported by grants number 1/4209/97 and 1/5229/98 sion Analysis, Volume II (pp. 721{762). Menlo Park,
of the Scientic Grant Agency of Slovak Republic. CA: Strategic Decisions Group.
Kargupta, H. (1998). Revisiting the GEMGA: Scalable
The work was sponsored by the Air Force Oce of evolutonary optimization through linkage learning.
Scientic Research, Air Force Materiel Command, In Proceedings of 1998 IEEE International Confer-
USAF, under grant number F49620-97-1-0050. Re- ence on Evolutionary Computation (pp. 603{608).
search funding for this project was also provided by IEEE Press.
a grant from the U.S. Army Research Laboratory Kvasnicka, V., Pelikan, M., & Pospichal, J. (1996). Hill
under the Federated Laboratory Program, Coopera- climbing with learning (An abstraction of genetic al-
tive Agreement DAAL01-96-2-0003. The U.S. Govern- gorithm). Neural Network World , 6 , 773{796.
ment is authorized to reproduce and distribute reprints Muhlenbein, H. (1997). The equation for response to se-
for Governmental purposes notwithstanding any copy- lection and its use for prediction. Evolutionary Com-
putation , 5 (3), 303{346.
right notation thereon. The views and conclusions con- Muhlenbein, H., Mahnig, T., & Rodriguez, A. O. (1998).
tained herein are those of the authors and should not Schemata, distributions and graphical models in evo-
be interpreted as necessarily representing the ocial lutionary optimization. Submitted for publication.
policies and endorsements, either expressed or implied, Muhlenbein, H., & Paa, G. (1996). From recombination
of the Air Force of Scientic Research or the U.S. Gov- of genes to the estimation of distributions I. Binary
ernment. parameters. Parallel Problem Solving from Nature,
PPSN IV , 178{187.
References Munetomo, M., & Goldberg, D. E. (1998). Design-
ing a genetic algorithm using the linkage identi-
Baluja, S. (1994). Population-based incremental learn- cation by nonlinearity check (Technical Report
ing: A method for integrating genetic search 98014). Urbana, IL: University of Illinois at Urbana-
based function optimization and competitive learning Champaign.
(Tech. Rep. No. CMU-CS-94-163). Pittsburgh, PA: Pearl, J. (1988). Probabilistic reasoning in intelligent
Carnegie Mellon University. systems: Networks of plausible inference. San Ma-
Baluja, S., & Davies, S. (1997). Using optimal teo, California: Morgan Kaufmann.
dependency-trees for combinatorial optimization: Pelikan, M., & Muhlenbein, H. (1999). The bivariate
Learning the structure of the search space. In Pro- marginal distribution algorithm. In Roy, R., Fu-
ceedings of the 14th International Conference on Ma- ruhashi, T., & Chawdhry, P. K. (Eds.), Advances
chine Learning (pp. 30{38). Morgan Kaufmann. in Soft Computing - Engineering Design and Manu-
De Bonet, J. S., Isbell, C. L., & Viola, P. (1997). MIMIC: facturing (pp. 521{535). London: Springer-Verlag.
Finding optima by estimating probability densities. Thierens, D. (1995). Analysis and design of genetic al-
In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), gorithms. Leuven, Belgium: Katholieke Universiteit
Advances in Neural Information Processing Systems, Leuven.
Volume 9 (pp. 424). The MIT Press, Cambridge.
Edmonds, J. (1967). Optimum branching. J. Res.
NBS , 71B , 233{240.
Goldberg, D. E. (1989). Genetic algorithms in search,
optimization, and machine learning. Reading, MA:

You might also like