Simple Representation Technique To Improve GA Performance
Simple Representation Technique To Improve GA Performance
Simple Representation Technique To Improve GA Performance
to Improve GA Performance
Steven L. Keast
Department of Computer Science and Software Engineering
Auburn University
Auburn, AL.
keastsl@auburn.edu
A new operator called Cloning is introduced that is a key contributor to the performance of
the GA. Cloning creates new individuals that have gene values identical to the hypotheses
from which they were cloned, but with dramatically different gene representations. This is
possible due to the technique used to encode the individual traits. By cloning hypotheses,
later crossover operations create offspring that can be very different than their parents, thus
increasing the speed in which the hypothesis space is searched.
Finally, this paper discusses the test results of the GA and characterizes its performance.
Areas covered include investigating the GA’s performance when Tour size, population size
and cloning rate are varied.
Introduction
One of the key problems faced by genetic algorithms is how to represent the phenotype as a
genotype – that is, to encode an individual in terms of specific gene values. Making a poor
decision on the phenotype representation has significant consequences. It can permit the GA
to get stuck on a sub-optimal solution. This can occur if the population converges on a less-
than-perfect hypothesis. If the population becomes the same non-ideal chromosome, further
progress of the algorithm will be difficult. Another potential issue is the creation of
chromosomes that have invalid gene values. This can occur during crossover or mutation. To
handle this problem the individual can simply be tossed, or it can be further modified to
create a valid individual. In either case, processor cycles are lost and the search time is
increased.
One way the representation problem has been addressed has been the simulation of biological processes.
Genotypes have been expressed using techniques such as variable length encoded strings, non-coded
segments (introns) and redundant genes. GAs using these techniques have shown improved performance in
searching the hypothesis space for the target solution.
Breeder Genetic Algorithms (BGAs) [6] take a different tack in solving the problem of
representation. They use floating point numbers to encode gene values, and add small
random floating point numbers to these numbers for the mutation operation. Researchers
such as Salomon have shown that BGAs can reduce the search effort from O(l q ln l ) for
simple binary encoding to O(l ln l) [6]. Salomon also describes in [6] the issue of the
Hamming distance between adjacent numbers and how this distance affects the probability of
This paper offers a simple new representation technique that solves the problems of the
creation of illegal offspring and of getting stuck on sub-optimal solutions. Gene values,
rather than being represented by Gray Codes, simple binary strings, or floating point numbers
are represented by 32-bit binary values. The paper also introduces a new operator, Cloning,
that replaces the traditional mutation operator used in GAs today. Cloning creates new
individuals that are identical to their clone/parent, yet with dramatically different 32-bit gene
representations.
Another problem is what to do with illegal hypotheses (allele value is not valid) that are
created by the crossover and mutation operators. This problem is caused by having traits
within the allele that have a total number of value options that are not a base 2n number. For
example, a trait could have a total of five possible values. This would require three bits to
encode, permitting a total of eight possible values. This means the mutation and crossover
Related Work
Although much of the work that has gone into the representation problem has dealt with
applying a specific solution to a unique problem, there is also much work to date that deals
with a more generalized study of representation and how it affects GA performance. This
work can be broken into two broad areas of study: GAs that simulate biological processes in
an effort to improve performance, and GAs that propose representation techniques that give
the GA an advantage in searching the hypothesis space efficiently.
Biological GAs
Biologically-based GAs break the representation problem down typically into the following
key components [1]:
♦ Genomes vary in length due to insertions, deletions and recombination.
♦ Genes on a genome can be position-independent.
♦ Genomes can contain non-coding sequences (Introns [3]).
♦ Genomes can have competing or duplicate genes.
♦ Genomes have overlapping reading frames.
The biological representation of the phenotype tries, through the simulation of known
biological processes, to solve problems that arise during GA operation and improve GA
performance. One area addressed by this technique is the use of Introns [1], non-coded
sequences within the genotype, to solve the problem of producing illegal offspring from the
crossover and mutation operators. By putting in a guard-block (the “Intron”) between valid
gene sequences, gene regions can be protected from these operations [1]. The addition of
Introns can have one other major benefit – GA operation can be improved by as much as a
factor of 10 [3].
Besides non-coded sequences, biological-based GAs also can vary the length and the position
of the gene within the genome [5]. Variable length and position representations require the
genome to be scanned to determine the phenotype value. The actual gene value within the
genome is delineated with START and STOP codes [5] and these codes can be found on
multiple reading frames. The crossover and mutation operators are much the same as a
standard GA but due to the variability of gene size and position within the parents, the
crossover operator must match up closely in both parents (same region on the genes) before
recombination is performed [5]. The mutation operator typically uses a fairly low probability
of 0.001 in many cases for best performance.
Biological GAs, by using representations that more closely follow natural processes, can
have improved performance over a more standard representation technique. The key to this
representation is that it allows for a “considerable degree of self-adaptation” [5]. The genome
with variable length and position genes, duplicate genes and non-coded sequences allows for
far more variability in future generations compared to other representation forms.
Designed-for-Performance GAs
Several different techniques are used to speed up GA performance outside the scope of
biological process simulation. One idea is the messy GA (mGA) [2] proposed in 1988 and
first published in 1989 by Goldberg, Korb and Deb [2]. The mGA represents genes as a pair
of numbers with the gene value defined as a binary or floating-point number and the gene
name included as one component of the pair. An example of a mGA gene could be (1 7) that
defines gene 1 with a value of 7. The chromosome of a mGA is a collection of gene values
that can include multiple defines for the same gene and certain gene types could even be
missing from the chromosome.
GA Problems
Salomon points out in [7] several issues involved with GA design. The main concern is the
revisiting of already sampled points in the search space due to the random application of
“variation operators” [7]. By revisiting these points, GA performance is reduced and less
predictable. The paper also looks at the Scheme Theorem, Breeder GAs and probabilistic
estimates of GA performance arguing that all misstate the computational complexity of a
GA.
Although [7] seems far from the representation techniques proposed in this paper, one result
of the new representation technique is to generate a great deal of variability in the offspring.
This could be seen as having the same effect as is seen in [7] by its author’s application of
the mutation operator in a deterministic fashion.
As Nicholas J. Radcliffe pointed out in [4], maximizing the length of the chromosome “will
give rise to the greatest degree of intrinsic parallelism” and “achieve maximum processing
efficiently” from the GA. The binary representation of gene values by itself can go a long
way in increasing the chromosome length. Whether coding of traits is in Gray Codes or a
simple binary representation, the chromosome length initially increases quickly as the
number of values for a gene increases.
The proposed representation technique tries to take advantage of the property of increased
chromosome length leading to increased GA efficiently. Gene values within the chromosome
are encoded as 32-bit integers with all numbers within the integer being legal values used to
determine a gene’s value. By using such a large number for each gene, chromosome length is
far larger than in most typical GAs. It should also be pointed out that 32-bit integers don’t
have to be the upper limit of a gene’s length for this representation. Other representations can
be done that exceed the 32-bit length suggested in this paper and further increases in
efficiency might be possible.
Genes within the chromosome are initially assigned values that are created by a 32-bit
random number generator. The actual gene value is calculated by taking the modulo of the
32-bit integer as follows.
If, for example, a gene represents eye color and the legal values for eye color are blue, green,
brown and hazel, the variable TotalNumberOfGeneValues will be set to 4. The total number
of ways any gene value could be represented by the 32 bit integer is calculated by dividing
the size of a 32-bit integer by the variable TotalNumberOfGeneValues
(232 / TotalNumberOfGeneValues). The effect of using such large numbers to represent each
gene means the search space is greatly increased.
A chromosome can be defined with a structure. For example, an individual that has four
genes would be characterized with the following structure:
struct individual
{
int geneA;
int geneB;
int geneC;
int geneD;
};
Although the new representation technique consumes more memory than a technique that
allows only one number per gene value and packs the genes together in an attempt to
minimize memory space requirements, computers today have much larger memories than
even just a few years ago. Using more memory to define a chromosome isn’t an issue today.
The initial population is created in the same fashion as typical GAs – using a random number
generator to create the values for each gene within an individual’s chromosome. The only
difference is that the random number for this representation must be 32 bits in length.
The key to the speed of a GA using this representation technique is the variability of the
offspring caused by the crossover operator. Whether the crossover operator used is single or
double, as long as the crossover operation can slice through any location within the gene, this
representation will create children that have a high probability of varying from their parents.
Note that this representation technique has no advantages when uniform crossover is used.
To illustrate how variability can be achieved in offspring, the following two individuals are
composed of four genes with a total of thirteen value options per gene. Using the modulo
operation on each gene, both individuals have gene values of 6, 1, 11, 11.
Using single point crossover to slice through the middle of gene two of each chromosome
and then recombining the parents to create two children, the following new individuals will
be created.
Child A now has the gene values of 6, 8, 11, 11 and Child B is 6, 7, 11, 11. In a more
traditional representation technique, slicing through a gene from parents having the same
value in that gene would produce no variation for that gene value in the children.
By creating variability in the children with the crossover operator, a mutation operator isn’t
required with this representation. Also the problem faced by many GAs requiring the use of
mutation to move the population off of a sub-optimal solution occurs naturally with this
representation.
Although the two individual seems very different, if the modulo 13 of each gene is taken,
they both decode to the following values. This means both individuals will have exactly the
same fitness even though they appear to be dramatically different.
6, 1, 11, 11
In summary: Cloning is a way to create greater diversity in the generations that follow from
the crossover operations that are applied to the parent chromosomes. When a gene is sliced,
the chances of the recombined genes from two parents creating the same modulo value in the
offspring gene is far less than in traditional representation techniques. In a traditional
encoding scheme, two parent genes that have the same value will always produce offspring
that have this same value in this gene no matter where a crossover operation might happen
within the gene. By cloning and replacing individuals in the population with their clones,
variability can be generated in succeeding generations and the search space more easily
covered with fewer crossover operations.
One major problem arises from using the new phenotype representation technique; genes
within the genotype that have an even number of possible values don’t converge towards a
solution. This is called in this paper the “even-valued gene” problem, and it does have a
major impact on how the algorithm performs.
Two techniques are proposed to solve this problem. The first uses the gene as an encoded
number requiring decoding before the actual gene value is obtained. The second technique
modifies genes that have an even number of values with one additional value that is illegal.
This solves the problem but also creates offspring that have possible illegal gene values that
need to be dealt with.
To illustrate the problem, the following graph shows the results of the new representation
technique trying to solve the sample problem defined in the section “Target Problem for GA
Testing and Its Common Fitness Function” below. The number of iterations through the GA
loop was limited to 1,000. Without the limit placed on the number of iterations, the loop in
many cases would go on forever.
1200
Iterations to Reach Target
1000
800
Hypothosis
600
400
200
0
10
12
14
16
18
20
22
24
26
28
30
32
4
Problem Size
The first idea proposed to solve this problem was encoding the gene value. This technique
does a very simple operation on a gene’s representational integer to find the final gene value;
the gene is rotated by the number in its bits D0 through D2 with eight added to create the
final amount by which the integer is rotated.
By using this idea, genes that have an even number of values can still be represented by
integers that are of odd and even types. When these integers are combined through the
crossover operator, the result will create children that have a reasonable amount of diversity.
The following graph shows the results of applying this simple solution to the sample problem
and the resulting improvement in the algorithm’s performance.
800
Iterations to Reach Target
700
600
Hypothosis
500
400
300
200
100
0
10
12
14
16
18
20
22
24
26
28
30
32
4
Problem Size
Another option is to not allow even-valued genes in a chromosome. Although this idea is
against the original intent of the algorithm to not create illegal values in children
chromosomes that need to be dealt with, it does offer another solution to the problem. The
implementation of this technique is to simply add one additional illegal value to all genes that
have a total number of value options that are an even number. When crossover is performed,
children that are created that have illegal valued genes are simply discarded. The following
graph shows the results of modifying the GA with this scheme to solve the problem.
250
Iterations to Reach Target
200
Hypothosis
150
100
50
0
10
12
14
16
18
20
22
24
26
28
30
32
4
Problem Size
40
Production of Illegal Children
% of Iterations that Result in
35
30
25
20
15
10
5
0
10
12
14
16
18
20
22
24
26
28
30
32
4
Problem Size
Both techniques to solve this problem were tested and the results captured in the “Test
Results” section later in this paper.
Hypothesis Representation
Each individual will be represented by a structure with one 32-bit integer assigned to each
trait within the allele. For example an individual that is represented by four traits would be
defined as follows:
During the test phase of this project, the number of traits will be varied to determine two
things. First by varying the number of traits, the number of hypotheses being searched is
varied as well. The search performance will be compared to the size of the search space to
see how the GA scales to the problem size. The number of traits will also be varied to
determine if the algorithm has any biases towards the number of values in a trait.
Population
The population is the group of potential parent hypotheses that are available for the selection
process. The initial population will be randomly generated. During testing of the GA, the
population size will be varied to see how population size impacts the performance of the GA.
Selection Technique
Tournament selection will be used for all GA testing. The Tour size will be varied to
determine GA performance as a function of Tour size. The selection process will always only
select two parents for breeding.
Crossover Operator
In all cases single-point crossover will be used to create two offspring from the selected
parents. A maximum of one offspring can survive. The most-fit offspring is determined and it
can replace the worst individual in the population if its fitness is better than that individual’s
fitness.
Cloning Operator
The cloning operator will replace the mutation operator that is usually a standard feature of a
GA. The cloning operator will be applied on the basis of algorithm stagnation. The stagnation
is measured as a function of offspring not replacing the worst parent in the population. The
number of times this replacement does not occur is counted and at some defined level, the
cloning operator is applied. The level the at which the cloning operator is applied at will be
varied during the test phase to see how GA performance is affected by this variable.
The fitness function will measure the distance a hypothesis is from the goal number by
comparing each digit in the hypothesis with the corresponding digit in the goal number. The
absolute value of the difference for each digit is added together to produce the fitness of any
individual. The goal number is reached when the fitness function returns a fitness value of
zero.
Test Results
The performance of the GA was gathered by running a number of test runs and then
averaging the results. As was pointed out in [9] “performance measurements are generally
taken as average over sets of runs”, so averaging GA results is key to producing accurate
data. In all test cases the GA was run on a problem for a total of 500 passes. The GA was
also limited to 1,000 iterations on a search before the search was aborted. The failure to find
the optimum hypothesis wasn’t gathered in the test results and the number 1,000 was used as
the search iteration number on all failures.
All results will show two graphs; one for each solution to the even-valued gene problem. In
most cases the two solutions perform almost identically on the problem but a lot of variation
can be seen when the problem size is varied, creating several searches that bring the problem
out of the algorithm.
The test results also include a test run on the GA to evaluate its robustness in dealing with
small populations. Although this test has little to do with GA performance, it does
demonstrate the GA’s ability to move towards a solution even when the GA is configured in
a way that would stop other GAs and leave them stuck on a sub-optimal result.
This test varies the population size while keeping the problem size, cloning rate and Tour
percentage fixed. The Tour size does vary as a function of population size because its size is
fixed at 10% of the population size. The cloning rate was 100%. This means the result of the
crossover operation is cloned if the child created by the crossover operation doesn’t improve
upon the fitness of the parents. The problem size was fixed as a four-digit base 33 number.
The selection of a base 33 number means the even-valued gene problem will not be apparent
in the resulting data.
As can be seen from the graphs below, population size doesn’t have a lot of impact on GA
performance. In fact, smaller initial population sizes do slightly better than larger ones. One
possible reason for this behavior is that the GA is more focused when the population size is
small and able to create faster forward progress with fewer individuals.
300
Reach Target Hypothesis
Number of Iterations to
250
200
150
100
50
0
10 20 30 40 50 60 70 80 90 100
Population Size
350
Iterations to Reach Target
300
250
Hypothesis
200
150
100
50
0
10 20 30 40 50 60 70 80 90 100
Population Size
This test varies the problem size while keeping the population size, Tour percentage and
cloning rate constant. The population was fixed at 50 individuals, the Tour percentage was
set at 10% of the population size (in this case 5), and the Cloning rate was set at 100%. The
problem size was controlled by generating a target four-digit number whose base was varied
from base 4 to base 32. Creating target numbers with even bases will create the even-valued
gene problem and the difference in the two techniques to resolve this problem is apparent in
the graphed data below.
700
Iterations to Reach Target Hypothosis
600
500
400
300
200
100
0
4
10
12
14
16
18
20
22
24
26
28
30
32
Problem Size
250
Iterations to Reach Target Hypothesis
200
150
100
50
0
10
12
14
16
18
20
22
24
26
28
30
32
4
Problem Size
This test varies the Tour size by changing the Tour percentage while keeping the population
size, problem size and cloning rate fixed. The population size was set at 50, the Cloning rate
was 100% and the problem size was fixed at a four-digit base 33 number.
One surprising result seen in the graphs below is that the performance degrades as the Tour
percentage is increased to 100% of the population size. At Tour percentages from 10% to
40% of the population size, the GA performance is roughly the same. Above 40% however
the iterations required through the crossover operator to find the target hypothesis increases
rapidly. At the 100% Tour percentage level, the search time is increased by roughly 30%
over the search time for a 10% Tour percentage.
An explanation for this is that higher Tour sizes have a tendency to select the same
individuals from the population over and over again. At smaller Tour sizes, it is more likely
more individuals from the population will be selected and this will create more diversity in
the offspring, moving the GA forward more quickly to a solution.
350
Iterations to Reach Target Hypothesis
300
250
200
150
100
50
0
10 20 30 40 50 60 70 80 90 100
Tour Size
350
Iterations to Reach Target Hypothosis
300
250
200
150
100
50
0
10 20 30 40 50 60 70 80 90 100
Tour Size
This test varies the cloning rate while keeping the population size, Tour size and problem
size constant. For this test the population size was fixed at 50, the Tour percentage was set at
10% and the problem size was a four-digit base 33 number.
The cloning rate isn’t a probability for applying the Cloning operator but instead is a number
representing the number of iterations through the crossover operator without any children
replacing individuals in the population. For this test the rate was varied from 1 to 100 misses
before the operator was applied.
One thing is obvious from the graphs below of the test results; the Cloning operator has a
major impact on the GA’s performance. Cloning the parents at a 100% rate when a crossover
operation doesn’t create children that have fitness better than their parents improves GA
performance by 3-fold over cloning at the lowest rate tested (100 misses before Cloning is
performed).
800
700
600
500
400
300
200
100
0
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
0
10
Cloning Rate
900
Iterations to Reach Target Hypothosis
800
700
600
500
400
300
200
100
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
1
5
0
10
Cloning Rate
An early test done to validate the operation of the new representation technique was to use a
population size of 2 and verify the GA could still find the target hypothesis. In tests done on
GAs using a simple binary representation for the phenotype, small populations almost
guaranteed the GA would get stuck on a sub-optimal solution. This also occurred very early
in the GA’s processing of the problem and could only be resolved by the mutation operator.
The ability of the new representation to deal with this problem shows how it can still quickly
navigate through the search space even when it has few candidate hypotheses.
The following graphs show the performance of the GA using the new representation
technique with a population size fixed at 2. The problem size was varied from a base 4 four-
digit number to a base 32 four-digit number. The Tour percentage doesn’t apply in this test.
The 2 individuals in the population are always the parents of the next generation.
700
Iterations to Reach Target Hypothosis
600
500
400
300
200
100
0
4
10
12
14
16
18
20
22
24
26
28
30
32
Problem Size
140
Iterations to Reach Target
120
100
Hypothosis
80
60
40
20
0
4
10
12
14
16
18
20
22
24
26
28
30
32
Problem Size
Conclusions
Although the representation technique is a simple one, it creates a great deal of diversity in
the offspring produced by the crossover operator and helps the GA quickly navigate through
the search space. The addition of the Cloning operator improves the performance of the GA
by as much as a factor of 3 and is a key component in the proposed GA enhancement
scheme.
Testing shows the GA stays within the limit of O(n ln n) for iterations through the crossover
operator growth based on input size. This is very similar to BGAs [6].
Additional work could look into the application of this representation on other problems
including TSP. The problem chosen for this paper was searching for a target number whose
search space was easily controlled by varying the base of the target number. Although this
problem was convenient for testing the proposed GA, it isn’t a standard for GA performance
testing and other problems should be attempted.
[2] David E. Goldberg and Kalyanmoy Deb and Hillol Kargupta and Georges Harik (1993). Rapid
Accurate Optimization of Difficult Problems Using Fast Messy Genetic Algorithms. Proc. of the Fifth
Int. Conf. on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA. Pages 56-64
[3] James R. Levenick (1991). Inserting introns improves genetic algorithm success rate: Taking a
cue from biology. Proceedings of the Fourth International Conference on Genetic Algorithms:
Morgan Kaufman, San Mateo, CA. Pages 123-127
[4] Nicholas J. Radcliffe (1992). Non-Linear Genetic Representations. Parallel problem solving from
nature 2, Publisher North-Holland, Amsterdam. Pages 259-268
[5] Connie Loggia Ramsey and Kenneth A. De Jong and John J. Grefenstette and Annie S. Wu and
Donald S. Burke (1998). Genome Length as an Evolutionary Self-adaptation. Lecture Notes in
Computer Science volume 1498 pages 345-???.
[6] Salomon, R. (1996). The influence of different coding schemes on the computational complexity
of genetic algorithms in function optimization. In H.-M. Voigt, W. Ebeling, I. Rechenberg, and H.-P.
Schwefel (Eds.), Proceedings of the Fourth International Conference on Paprallel Problem Solving in
Nature, Berlin, pp. 227–235. Springer-Verlag.
[7] Salomon, R. (1997). Improving the Performance of Genetic Algorithms through Derandomization.
Software - Concepts and Tools, volume 18, number 4 pages 175-184
[8] Wagner, G. P. (1995). Adaptation and the modular design of organisms. In Moran, F., Moreno,
A., Merelo, J.J., and Chacon, P. eds. Lecture notes in artificial intelligence: advances in artificial life,
317-328. Berlino-Heidelberg: Springer-Verlag.
[9] Annie S. Wu and Robert K. Lindsay and Michael D. Smith (1994). Studies on the effect of non--
coding segments on the genetic algorithm. Proceedings of the 6th IEEE International Conference on
Tools with Artificial Intelligence, New Orleans, LA
[10] Annie S. Wu and Robert K. Lindsay and Rick Riolo (1997). Empirical Observations on the
Roles of Crossover and Mutation. Proc. of the Seventh Int. Conf. on Genetic Algorithms: Morgan
Kaufmann, San Francisco, CA, pages 362-369
[11] A. Wu and I. Garibay (2002). The proportional genetic algorithm: Gene expression in a genetic
algorithm. Genetic Programming and Evolvable Hardware, vol. 3, no. 2
[12] Yu, Tina and Bentley, Peter (1998). Methods to Evolve Legal Phenotypes. Agoston E. Eiben and
Thomas Back and Marc Schoenauer and Hans-Paul Schwefel eds. Fifth International Conference on
Parallel Problem Solving from Nature: Springer-Verlag volume 1498, month 27-30, pages 280-291.
ISBN 3-540-65078-4