Evolutionary Algorithms, Fitness Landscapes and Search
Evolutionary Algorithms, Fitness Landscapes and Search
Evolutionary Algorithms, Fitness Landscapes and Search
by
Terry Jones
DISSERTATION
For me.
iv
Acknowledgments
I have somehow managed to spend six years, in three universities, working to-
wards a Ph.D. After a false start at the University of Waterloo, I enrolled at Indiana
University and subsequently transferred to the University of New Mexico after leaving
to do research at the Santa Fe Institute. In the course of these wanderings, I have
been fortunate to interact with many people who have inuenced me greatly. One of
the pleasures of nally nishing is this opportunity to thank them.
At Waterloo, Charlie Colbourn was the rst professor who regarded me as a
friend rather than as a chore. Charlie inspired me to work on algorithms in a rigorous
way, taught me what a proof is, and treated me as though what I thought was
important. He apparently e
ortlessly supervised ten graduate students and produced
many papers while always having time to talk, eat lunch, share a joke, and even help
students move house.
Soon after arriving at Waterloo, I was adopted by Gregory Rawlins, who unex-
pectedly invited me to live at his apartment, after noticing that I was living in the
Math building, using the suspended ceiling as a wardrobe. Gregory introduced me to
posets and optimality in comparison-based algorithms. More than anyone, he has in-
uenced the research directions I have taken over the years. He subsequently became
my supervisor at Indiana and remained on my dissertation committee when I trans-
ferred to UNM. Gregory taught me to ask interesting questions, showed me what
quality was, and encouraged my occasionally odd approaches to problem solving.
I was also greatly inuenced by Ian Munro, who supervised my masters work
and my rst attempt at a Ph.D. Ian's ability to rapidly assess the worth of ideas and
algorithms is amazing. I would spend a week or two on an approach to a problem
and Ian could understand it, deconstruct it and tell me (correctly) that it wouldn't
work in about a minute. On one memorable occasion, I went into his oce with an
algorithm I had devised and worked on furiously for at least two weeks. Ian listened
for his customary minute, leaned back in his chair, and said \Do you want me to tell
you smallest n for which it will not be optimal?"
Andrew Hensel, with whom I shared so much of my two and a half years at
Waterloo, was the most original and creative person I have ever known well. Together,
we dismantled the world and rebuilt it on our own crazy terms. We lived life at a
million miles an hour and there was nothing like it. Five years ago, Andrew killed
himself. There have been few days since then that I have not thought of him and the
time we spent together.
v
ideal foil to my recreational tendencies and limited attention span. It was Steph who
insisted that I look for a connection between my landscapes and AI search techniques.
I have come to respect and trust her judgement and to think of her as a good friend,
collaborator and con dant.
Of the universities I have attended, UNM's computer science department is by
far the smallest and least well funded. However, it has been at UNM though that I
have nally found myself surrounded by other graduate students with whom I was
inclined to interact academically. The Friday meetings of the \adaptive" group have
been the highlight of my UNM time and I thank all the members of the group for
keeping life interesting. In particular, I'd like to thank my evil twins, Derek Smith
and Ron Hightower. Together we have spent many enjoyable hours rollerblading at
high speed on campus and talking about our research. Derek in particular has always
been very quick to understand my ideas about tness landscapes. He has helped
me to see where I was wrong and why I was occasionally right. He read this entire
work and greatly helped with its content, style, and appearance. This dissertation
would have been possible without Derek, but far less likely. Derek and Ron both
have broad knowledge of computer science and possess high-speed, accurate, research
quality detectors. They are full of interesting ideas and suggestions, and are models
of academic generosity.
It is not possible to summarize Ana Mostern Hopping and her inuence on me in
one paragraph, but I will try. Ana has completely changed my life over the last three
years. Her love, intelligence, honesty, goodness, healthiness, humor, taste, liveliness
and beauty have given me something to live for. She is the most balanced and well-
adjusted person I have ever known. The fact of her existence is a continual miracle
to me. She has supported me in hundreds of ways throughout the development and
writing of this dissertation.
My parents are also very special people. My mother once told me that the last
thing she and I had in common was an umbilical cord. Despite this, we are very close
and I love them dearly. They have given their unconditional support, knowing that
doing so contributed greatly to my absence these last nine years. They were strong
enough to let me go easily, to believe in me, and to let slip away all those years during
which we could have been geographically closer and undoubtedly driving each other
crazy.
The other members of my dissertation committee, Ed Angel, Paul Helman,
George Luger and Carla Wofsy were very helpful. They were all interested in what
I was doing and improved the quality of what I eventually wrote. Ed's very healthy,
even robust, cynicism and his willingness to give me a hard time at every opportunity,
vii
is the kind of attitude that I appreciate and enjoy the most. Paul made very detailed
comments on all aspects of the dissertation, and tried to keep me honest with respect
to connections to dynamic programming and branch and bound. George refused to
let me get away with murder in the research hours I took with him, and made sure
that I did a good job of understanding AI search algorithms. Though Carla was a
late addition to the committee, she quickly and happily read the dissertation.
Joe Culberson, of the University of Alberta, is one of the few people who appre-
ciated and fully understood what I was trying to achieve in this dissertation. Joe and
I share similar views on many aspects of tness landscapes. I had known Joe for years
before I was pleasantly surprised to discover that we had independently conceived of
a crossover landscape in essentially the same way. We have exchanged many ideas
about landscapes for the last two years, keeping careful track of each other's sanity.
My thinking and writing are clearer as a result of Joe's rigor and insistence that I
make myself clear to him.
At SFI I have pro ted from many discussions with Melanie Mitchell, Chris Lang-
ton, Richard Palmer, Una-May OReilly, Rajarshi Das, Michael Cohen, Aviv Bergman,
Peter Stadler, Walter Fontana, Stu Kau
man, and Wim Hordijk (who also read and
commented on the rst two chapters of this dissertation). I have made friends and re-
ceived encouragement from many people in the research community, including David
Ackley, Ken De Jong, David Fogel, Je
Horn, Rich Korf, Nils Nilsson, Nick Radcli
e,
Rick Riolo, Gunter Wagner and Darrell Whitley. Jesus Mostern read my thesis pro-
posal, cheerfully scribbled \nonsense!" across it, and then sent me a useful page of
formal de nitions. Many members of the sta
at SFI, especially Ginger Richardson,
have become good friends. The CS department at UNM has been wonderful I have
been particularly helped by Ed Angel, Joann Buehler, Jim Herbeck and Jim Hollan.
Finally, I have made many friends along the way. They have helped me, one
way or another, in my struggle to complete a Ph.D. Many thanks to Sandy Amass,
Marco Ariano, Marcella Austin, Amy Barley, Greg Basford, Dexter Bradshaw, Ted
Bucklin, Peter Buhr, Susy Deck, Emily Dickinson, Elizabeth Dunn, Beth Filliman,
Bob French, Julie Frieder, Elizabeth Gonzalez, Fritz Grobe, Steve Hayman, Ursula
Hopping, Helga Keller, The King, Jim Marshall, Gary McGraw, Heather Meek, Eric
Neufeld, Luke OConnor, Lisa Ragatz, Steven Ragatz, Kate Ryan, Philip San Miguel,
John Sellens, Francesca Shrady, Lisa Thomas, Andre Trudel and the one and only
Francoise Van Gastel.
viii
ABSTRACT
A new model of tness landscapes suitable for the consideration of evolution-
ary and other search algorithms is developed and its consequences are investigated.
Answers to the questions \What is a landscape?" \Are landscapes useful?" and
\What makes a landscape dicult to search?" are provided. The model makes it
possible to construct landscapes for algorithms that employ multiple operators, in-
cluding operators that act on or produce multiple individuals. It also incorporates
operator transition probabilities. The consequences of adopting the model include a
\one operator, one landscape" view of algorithms that search with multiple operators.
An investigation into crossover landscapes and hillclimbing algorithms on them
illustrates the dual role played by crossover in genetic algorithms. This leads to the
\headless chicken" test for the usefulness of crossover to a given genetic algorithm
and to serious questions about the usefulness of maintaining a population. A \reverse
hillclimbing" algorithm is presented that allows the determination of details of the
basin of attraction of points on a landscape. These details can be used to directly
compare members of a class of hillclimbing algorithms and to accurately predict how
long a particular hillclimber will take to discover a given point.
A connection between evolutionary algorithms and the heuristic search algo-
rithms of Arti cial Intelligence and Operations Research is established. One aspect
of this correspondence is investigated in detail: the relationship between tness func-
tions and heuristic functions. By considering how closely tness functions approxi-
mate the ideal for heuristic functions, a measure of search diculty is obtained. This
measure, tness distance correlation, is a remarkably reliable indicator of problem dif-
culty for a genetic algorithm on many problems taken from the genetic algorithms
literature, even though the measure incorporates no knowledge of the operation of a
genetic algorithm. This leads to one answer to the question \What makes a problem
hard (or easy) for a genetic algorithm?" The answer is perfectly in keeping with what
has been well known in Arti cial Intelligence for over thirty years.
ix
TABLE OF CONTENTS
Chapter Page
1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.1. Evolutionary Algorithms : : : : : : : : : : : : : : : : : : 4
1.2. Fitness Landscapes : : : : : : : : : : : : : : : : : : : : : 5
1.3. Model of Computation : : : : : : : : : : : : : : : : : : : 5
1.4. Simplicity : : : : : : : : : : : : : : : : : : : : : : : : : : 9
1.5. Related Work : : : : : : : : : : : : : : : : : : : : : : : : 11
1.6. Dissertation Outline : : : : : : : : : : : : : : : : : : : : 11
1.7. Abbreviations : : : : : : : : : : : : : : : : : : : : : : : : 12
2 A MODEL OF LANDSCAPES : : : : : : : : : : : : : : : : : : 13
2.1. Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : 13
2.2. Preliminary De nitions : : : : : : : : : : : : : : : : : : : 17
2.2.1. Multisets : : : : : : : : : : : : : : : : : : : : : : : : 17
2.2.2. Relations, Orders and Digraphs : : : : : : : : : : : 18
2.3. Search Problems and Search Algorithms : : : : : : : : : 19
2.4. Landscapes : : : : : : : : : : : : : : : : : : : : : : : : : 25
2.5. De nitions and Special Cases : : : : : : : : : : : : : : : 26
2.5.1. -neighborhood and -neighbor : : : : : : : : : : : 27
2.5.2. -maximum or -peak : : : : : : : : : : : : : : : : 27
2.5.3. global-maximum : : : : : : : : : : : : : : : : : : : : 27
2.5.4. -local-maximum : : : : : : : : : : : : : : : : : : : 27
2.5.5. -plateau : : : : : : : : : : : : : : : : : : : : : : : : 28
2.5.6. -mesa : : : : : : : : : : : : : : : : : : : : : : : : : 28
2.5.7. -saddle-region : : : : : : : : : : : : : : : : : : : : 28
2.5.8. -basin-of-attraction : : : : : : : : : : : : : : : : : 29
2.5.9. Fixed Cardinality Operators : : : : : : : : : : : : : 29
2.5.10. Walkable Operators : : : : : : : : : : : : : : : : : : 29
2.5.11. Symmetric Operators : : : : : : : : : : : : : : : : : 30
2.5.12. -connected-components : : : : : : : : : : : : : : : 30
2.5.13. Natural Landscapes : : : : : : : : : : : : : : : : : : 30
2.6. Four Operator Classes and Their Landscapes : : : : : : : 30
2.6.1. Single-change Operators : : : : : : : : : : : : : : : 31
2.6.2. Mutational Operators : : : : : : : : : : : : : : : : : 32
x
Chapter Page
2.6.3. Crossover Operators : : : : : : : : : : : : : : : : : : 34
2.6.4. Selection Operators : : : : : : : : : : : : : : : : : : 34
2.7. Consequences of the Model : : : : : : : : : : : : : : : : : 36
2.8. Implications for Genetic Algorithms : : : : : : : : : : : : 37
2.9. Advantages of the Model : : : : : : : : : : : : : : : : : : 39
2.10. Limitations of the Model : : : : : : : : : : : : : : : : : : 40
2.11. Operators and Representation : : : : : : : : : : : : : : : 41
2.12. Objects and Representation : : : : : : : : : : : : : : : : 41
2.13. Origins and Choices : : : : : : : : : : : : : : : : : : : : : 42
2.14. The Usefulness of the Landscape Metaphor : : : : : : : : 45
2.15. Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : 46
3 CROSSOVER, MACROMUTATION, AND POPULATION-
BASED SEARCH : : : : : : : : : : : : : : : : : : : : : : : : : : 47
3.1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : 47
3.2. Search as Navigation and Structure : : : : : : : : : : : : 48
3.3. Crossover Landscapes : : : : : : : : : : : : : : : : : : : : 49
3.4. Experiment Overview : : : : : : : : : : : : : : : : : : : : 50
3.5. The Simple Genetic Algorithm : : : : : : : : : : : : : : : 54
3.6. The Bit-ipping Hillclimbing Algorithm : : : : : : : : : : 55
3.7. The Crossover Hillclimbing Algorithm : : : : : : : : : : : 57
3.8. Test Problems : : : : : : : : : : : : : : : : : : : : : : : : 58
3.8.1. One Max : : : : : : : : : : : : : : : : : : : : : : : : 58
3.8.2. Fully Easy : : : : : : : : : : : : : : : : : : : : : : : 59
3.8.3. Fully Deceptive : : : : : : : : : : : : : : : : : : : : 60
3.8.4. Distributed Fully Deceptive : : : : : : : : : : : : : : 60
3.8.5. Busy Beavers : : : : : : : : : : : : : : : : : : : : : 60
3.8.5.1. Fitness and Representation : : : : : : : : : : : 61
3.8.5.2. Mesas : : : : : : : : : : : : : : : : : : : : : : : 62
3.8.6. Holland's Royal Road : : : : : : : : : : : : : : : : : 62
3.8.6.1. Description : : : : : : : : : : : : : : : : : : : : 63
3.8.6.2. Experiments : : : : : : : : : : : : : : : : : : : 65
3.9. Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : 65
3.9.1. Overall Trends : : : : : : : : : : : : : : : : : : : : : 65
3.9.2. One Max : : : : : : : : : : : : : : : : : : : : : : : : 67
3.9.3. Fully Easy : : : : : : : : : : : : : : : : : : : : : : : 68
3.9.4. Fully Deceptive : : : : : : : : : : : : : : : : : : : : 69
3.9.5. Distributed Fully Deceptive : : : : : : : : : : : : : : 70
3.9.6. Busy Beavers : : : : : : : : : : : : : : : : : : : : : 70
3.9.7. Holland's Royal Road : : : : : : : : : : : : : : : : : 71
3.10. Why Does CH Perform so Well? : : : : : : : : : : : : : : 72
3.11. Crossover: The Idea and The Mechanics : : : : : : : : : 76
xi
Chapter Page
3.12. The Headless Chicken Test : : : : : : : : : : : : : : : : : 77
3.13. Macromutational Hillclimbing : : : : : : : : : : : : : : : 81
3.14. Summary : : : : : : : : : : : : : : : : : : : : : : : : : : 82
4 REVERSE HILLCLIMBING : : : : : : : : : : : : : : : : : : : : 85
4.1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : 85
4.2. Local Search Algorithms : : : : : : : : : : : : : : : : : : 86
4.2.1. Iterated Local Search Algorithms : : : : : : : : : : 87
4.3. Hillclimbing Algorithms : : : : : : : : : : : : : : : : : : 88
4.3.1. Any Ascent Hillclimbing : : : : : : : : : : : : : : : 88
4.3.2. Least Ascent Hillclimbing : : : : : : : : : : : : : : : 91
4.3.3. Median Ascent Hillclimbing : : : : : : : : : : : : : : 91
4.3.4. Steepest Ascent Hillclimbing : : : : : : : : : : : : : 91
4.4. The Reverse Hillclimbing Algorithm : : : : : : : : : : : : 91
4.4.1. The Basic Algorithm : : : : : : : : : : : : : : : : : 91
4.4.2. Augmenting the Basic Algorithm : : : : : : : : : : 93
4.5. An Important Tradeo
: : : : : : : : : : : : : : : : : : : 95
4.6. When can a Hillclimber be Reversed? : : : : : : : : : : : 96
4.7. Test Problems : : : : : : : : : : : : : : : : : : : : : : : : 97
4.7.1. NK Landscapes : : : : : : : : : : : : : : : : : : : : 97
4.7.2. Busy Beavers : : : : : : : : : : : : : : : : : : : : : 98
4.8. Experiment Overview : : : : : : : : : : : : : : : : : : : : 98
4.9. Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : 100
4.9.1. NK Landscapes : : : : : : : : : : : : : : : : : : : : 100
4.9.2. Busy Beavers : : : : : : : : : : : : : : : : : : : : : 108
4.10. Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : 113
5 EVOLUTIONARY ALGORITHMS AND HEURISTIC SEARCH 121
5.1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : 121
5.2. Search in Arti cial Intelligence : : : : : : : : : : : : : : : 122
5.3. Similarities Between Evolutionary Algorithms and
Heuristic Search : : : : : : : : : : : : : : : : : : : : : : : 124
5.3.1. Landscapes and State Spaces : : : : : : : : : : : : : 124
5.3.2. Individuals, Potential Solutions, and Populations : : 127
5.3.3. Fitness Functions and Heuristic Functions : : : : : : 127
5.3.4. Navigation Strategies and Control Strategies : : : : 128
5.4. Di
erences Between Evolutionary Algorithms and State
Space Search : : : : : : : : : : : : : : : : : : : : : : : : 130
5.5. Fitness and Heuristic Functions Revisited : : : : : : : : : 131
5.6. GA Diculty : : : : : : : : : : : : : : : : : : : : : : : : 132
5.7. Fitness Distance Correlation : : : : : : : : : : : : : : : : 134
5.7.1. Summary of Results : : : : : : : : : : : : : : : : : : 135
xii
Chapter Page
5.7.1.1. Con rmation of Known Results : : : : : : : : : 137
5.7.1.2. Con rmation of Unexpected Results : : : : : : 150
5.7.1.3. Con rmation of Knowledge Regarding Coding 152
5.7.2. Discussion : : : : : : : : : : : : : : : : : : : : : : : 158
5.7.3. Conclusion : : : : : : : : : : : : : : : : : : : : : : : 161
6 RELATED WORK : : : : : : : : : : : : : : : : : : : : : : : : : 162
6.1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : 162
6.2. Related Work on Landscapes : : : : : : : : : : : : : : : : 162
6.2.1. Landscapes in Biology : : : : : : : : : : : : : : : : 162
6.2.2. Landscapes in Physics and Chemistry : : : : : : : : 164
6.2.3. Landscapes in Computer Science : : : : : : : : : : : 165
6.3. Work Related to the Heuristic Search Connection : : : : 167
6.4. Work Related to FDC : : : : : : : : : : : : : : : : : : : 168
7 CONCLUSIONS : : : : : : : : : : : : : : : : : : : : : : : : : : : 172
7.1 What is a landscape? : : : : : : : : : : : : : : : : : : : : 172
7.2 What Makes a Search Algorithm? : : : : : : : : : : : : : 173
7.3 Can the Pieces be Reassembled? : : : : : : : : : : : : : : 174
7.4 Is Crossover Useful? : : : : : : : : : : : : : : : : : : : : : 175
7.5 Can the Usefulness of Crossover be Tested? : : : : : : : : 176
7.6 What is a Basin Of Attraction? : : : : : : : : : : : : : : 177
7.7 What Makes Search Hard? : : : : : : : : : : : : : : : : : 178
7.8 Are Evolutionary and Other Search Algorithms Related? 179
APPENDIX A : : : : : : : : : : : : : : : : : : : : : : : : : : : : 181
APPENDIX B : : : : : : : : : : : : : : : : : : : : : : : : : : : : 185
APPENDIX C : : : : : : : : : : : : : : : : : : : : : : : : : : : : 191
APPENDIX D : : : : : : : : : : : : : : : : : : : : : : : : : : : : 201
REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : 208
xiii
LIST OF FIGURES
Figure Page
Figure Page
17 CH, GA-S, GA-E and BH on 10 fully deceptive problems. : : : : : : 70
18 CH, GA-S, GA-E and BH on 15 fully deceptive problems. : : : : : : 70
19 CH, GA-S, GA-E and BH on 10 distributed fully deceptive problems. 71
20 CH, GA-S, GA-E and BH on 15 distributed fully deceptive problems. 71
21 CH, GA-S, GA-E and BH on the 3-state Turing machines. : : : : : : 72
22 CH, GA-S, GA-E and BH on the 4-state busy beaver problem. : : : : 72
23 CH, GA-S, GA-E and BH on Holland's royal road with k = 4. : : : : 73
24 CH, GA-S, GA-E and BH on Holland's royal road with k = 6. : : : : 73
25 CH, CH-NJ and CH-1S on a 120-bit one max problem. : : : : : : : : 75
26 CH, CH-NJ and CH-1S on a 90-bit fully easy problem. : : : : : : : : 75
27 CH, CH-NJ and CH-1S on a 90-bit fully deceptive problem. : : : : : 75
28 CH, CH-NJ and CH-1S on a 90-bit distributed fully deceptive problem. 75
29 CH, CH-NJ and CH-1S on the 4-state busy beaver problem. : : : : : 76
30 CH, CH-NJ and CH-1S on Holland's royal road with k = 4. : : : : : 76
31 The random crossover operator. : : : : : : : : : : : : : : : : : : : : : 78
32 GA-S and GA-RC on a 120-bit one max problem. : : : : : : : : : : : 79
33 GA-S and GA-RC on a 90-bit fully easy problem. : : : : : : : : : : : 79
34 GA-S and GA-RC on a 90-bit fully deceptive problem. : : : : : : : : 79
35 GA-S and GA-RC on a 90-bit distributed fully deceptive problem. : : 79
36 GA-S and GA-RC on the 4-state busy beaver problem. : : : : : : : : 80
37 GA-S and GA-RC on Holland's royal road with k = 4. : : : : : : : : 80
38 CH-1S, GA-E, BH-MM and BH-DMM on a 120-bit one max problem. 82
39 CH-1S, GA-E, BH-MM and BH-DMM on a 90-bit fully easy problem. 82
xv
Figure Page
40 CH-1S, GA-E, BH-MM and BH-DMM on a 90-bit fully deceptive
problem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
41 CH-1S, GA-E, BH-MM and BH-DMM on a 90-bit distributed
fully deceptive problem. : : : : : : : : : : : : : : : : : : : : : : : : : 83
42 CH-1S, GA-E, BH-MM and BH-DMM on the on the 4-state busy
beaver problem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
43 CH-1S, GA-E, BH-MM and BH-DMM on Holland's royal road
with k = 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83
44 A taxonomy of local search algorithms. : : : : : : : : : : : : : : : : : 89
45 The landscape of the any ascent operator. : : : : : : : : : : : : : : : 90
46 The landscape of the least ascent operator. : : : : : : : : : : : : : : : 90
47 The landscape of the median ascent operator. : : : : : : : : : : : : : 90
48 The landscape of the steepest ascent operator. : : : : : : : : : : : : : 90
49 Reverse hillclimbing on a one max problem. : : : : : : : : : : : : : : 93
50 The components of a simple NK landscape. : : : : : : : : : : : : : : 99
51 The 15-puzzle. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122
52 The 15-puzzle state space. : : : : : : : : : : : : : : : : : : : : : : : : 123
53 Summary of tness distance correlation (FDC) results. : : : : : : : : 136
54 Fitness versus distance on Ackley's one max problem. : : : : : : : : : 139
55 Fitness versus distance on Ackley's two max problem. : : : : : : : : 139
56 Fitness versus distance on a (12,1) NK landscape. : : : : : : : : : : : 140
57 Fitness versus distance on a (12,3) NK landscape. : : : : : : : : : : : 140
58 Fitness versus distance on a (12,11) NK landscape. : : : : : : : : : : 140
59 Fitness versus distance on a fully easy problem. : : : : : : : : : : : : 140
60 Fitness versus distance on two copies of a fully easy problem. : : : : 140
xvi
Figure Page
61 Fitness versus distance on three copies of a fully easy problem. : : : : 140
62 Fitness versus distance on Grefenstette's deceptive but easy problem. 141
63 Fitness versus distance on Grefenstette's non-deceptive but hard
problem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 141
64 Fitness versus distance on Ackley's porcupine problem. : : : : : : : : 142
65 Fitness versus distance on Horn & Goldberg's maximum modality
problem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142
66 Fitness versus distance on Ackley's mix problem. : : : : : : : : : : : 143
67 Fitness versus distance on two copies of a deceptive 3-bit problem. : 144
68 Fitness versus distance on three copies of a deceptive 3-bit problem. : 144
69 Fitness versus distance on four copies of a deceptive 3-bit problem. : 144
70 Fitness versus distance on a deceptive 6-bit problem. : : : : : : : : : 144
71 Fitness versus distance on two copies of a deceptive 6-bit problem. : 144
72 Fitness versus distance on three copies of a deceptive 6-bit problem. : 144
73 Fitness versus distance on Ackley's trap problem. : : : : : : : : : : : 145
74 Fitness versus distance on a deceptive 4-bit problem (Whitley's F2). 146
75 Fitness versus distance on two copies of a deceptive 4-bit problem
(Whitley's F2). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 146
76 Fitness versus distance on three copies of a deceptive 4-bit prob-
lem (Whitley's F2). : : : : : : : : : : : : : : : : : : : : : : : : : : : : 146
77 Fitness versus distance on a deceptive 4-bit problem (Whitley's F3). 146
78 Fitness versus distance on two copies of a deceptive 4-bit problem
(Whitley's F3). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 146
79 Fitness versus distance on three copies of a deceptive 4-bit prob-
lem (Whitley's F3). : : : : : : : : : : : : : : : : : : : : : : : : : : : : 146
80 Fitness versus distance on Holland's royal road on 32 bits. : : : : : : 147
xvii
Figure Page
81 Fitness versus distance on Holland's royal road on 128 bits. : : : : : 147
82 Fitness versus distance on a long path problem. : : : : : : : : : : : : 148
83 Fitness versus distance on the 2-state busy beaver problem. : : : : : 149
84 Fitness versus distance on the 3-state busy beaver problem. : : : : : 149
85 Fitness versus distance on the 4-state busy beaver problem. : : : : : 149
86 Fitness versus distance on a binary coded De Jong's F1. : : : : : : : 150
87 Fitness versus distance on a Gray coded De Jong's F1. : : : : : : : : 150
88 Fitness versus distance on a Tanese function on 16 bits. : : : : : : : 151
89 Fitness versus distance on a Tanese function on 32 bits. : : : : : : : 151
90 Fitness versus distance on royal road function R1. : : : : : : : : : : : 152
91 Fitness versus distance on royal road function R2. : : : : : : : : : : : 152
92 Fitness versus distance on Ackley's plateau problem. : : : : : : : : : 153
93 Fitness versus distance on De Jong's F2 binary coded with 8 bits. : : 154
94 Fitness versus distance on De Jong's F2 Gray coded with 8 bits. : : : 154
95 Fitness versus distance on De Jong's F2 binary coded with 12 bits. : 154
96 Fitness versus distance on De Jong's F2 Gray coded with 12 bits. : : 154
97 Fitness versus distance on De Jong's F2 binary coded with 16 bits. : 156
98 Fitness versus distance on De Jong's F2 Gray coded with 16 bits. : : 156
99 Fitness versus distance on De Jong's F2 binary coded with 24 bits. : 156
100 Fitness versus distance on De Jong's F2 Gray coded with 24 bits. : : 156
101 Fitness versus distance on De Jong's F3 binary coded with 15 bits. : 157
102 Fitness versus distance on De Jong's F3 Gray coded with 15 bits. : : 157
103 Fitness versus distance on De Jong's F5 binary coded with 12 bits. : 158
xviii
Figure Page
104 Fitness versus distance on De Jong's F5 Gray coded with 12 bits. : : 158
105 Fitness versus distance on Liepins and Vose's deceptive problem. : : 159
106 Fitness versus distance on the transform of Liepins and Vose's
deceptive problem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 159
107 A graphical interpretation of six another functions. : : : : : : : : : : 188
xix
LIST OF TABLES
Table Page
Table Page
19 Reverse hillclimbing from 1000 random 4-state busy beaver peaks. : : 113
20 Reverse hillclimbing from the 48 4-state busy beaver peaks. : : : : : 114
21 10,000 hillclimbs on the 4-state busy beaver landscape. : : : : : : : : 115
22 Expectation and observation on the 4-state busy beaver problem. : : 116
23 Reverse hillclimbing from 1000 random 3-state busy beaver peaks. : : 116
24 Reverse hillclimbing from the 40 3-state busy beaver peaks. : : : : : 117
25 10,000 hillclimbs on the 3-state busy beaver landscape. : : : : : : : : 118
26 Expectation and observation on the 3-state busy beaver problem. : : 118
27 Reverse hillclimbing from 1000 random 2-state busy beaver peaks. : : 119
28 Reverse hillclimbing from the 4 2-state busy beaver peaks. : : : : : : 119
29 10,000 hillclimbs on the 2-state busy beaver landscape. : : : : : : : : 120
30 Expectation and observation on the 2-state busy beaver problem. : : 120
31 The languages of evolutionary and heuristic search algorithms. : : : 125
32 Problems studied with tness distance correlation. : : : : : : : : : : 138
33 Several hillclimbers using an another function. : : : : : : : : : : : : : 186
34 Various hillclimbers on an NK 16,12 landscape. : : : : : : : : : : : : 187
35 Various hillclimbers on an NK 16,8 landscape. : : : : : : : : : : : : : 189
36 Various hillclimbers on an NK 16,4 landscape. : : : : : : : : : : : : : 189
37 Various hillclimbers on the 4-state busy beaver problem. : : : : : : : 190
38 Various hillclimbers on the 3-state busy beaver problem. : : : : : : : 190
39 Various hillclimbers on the 2-state busy beaver problem. : : : : : : : 190
40 Standard errors for CH, GA-S, GA-E and BH on the one max
problem with 60 bits. : : : : : : : : : : : : : : : : : : : : : : : : : : : 191
xxi
Table Page
41 Standard errors for CH, GA-S, GA-E and BH on the one max
problem with 120 bits. : : : : : : : : : : : : : : : : : : : : : : : : : : 191
42 Standard errors for CH, GA-S, GA-E and BH on the fully easy
problem with 10 subproblems. : : : : : : : : : : : : : : : : : : : : : : 192
43 Standard errors for CH, GA-S, GA-E and BH on the fully easy
problem with 15 subproblems. : : : : : : : : : : : : : : : : : : : : : : 192
44 Standard errors for CH, GA-S, GA-E and BH on the fully decep-
tive problem with 10 subproblems. : : : : : : : : : : : : : : : : : : : 192
45 Standard errors for CH, GA-S, GA-E and BH on the fully decep-
tive problem with 15 subproblems. : : : : : : : : : : : : : : : : : : : 193
46 Standard errors for CH, GA-S, GA-E and BH on the distributed
fully deceptive problem with 10 subproblems. : : : : : : : : : : : : : 193
47 Standard errors for CH, GA-S, GA-E and BH on the distributed
fully deceptive problem with 15 subproblems. : : : : : : : : : : : : : 193
48 Standard errors for CH, GA-S, GA-E and BH on the busy beaver
problem with 3 states. : : : : : : : : : : : : : : : : : : : : : : : : : : 194
49 Standard errors for CH, GA-S, GA-E and BH on the busy beaver
problem with 4 states. : : : : : : : : : : : : : : : : : : : : : : : : : : 194
50 Standard errors for CH, GA-S, GA-E and BH on Holland's royal
road problem with k = 4. : : : : : : : : : : : : : : : : : : : : : : : : 194
51 Standard errors for CH, GA-S, GA-E and BH on Holland's royal
road problem with k = 6. : : : : : : : : : : : : : : : : : : : : : : : : 195
52 Standard errors for CH, CH-1S and CH-NJ on the one max prob-
lem with 120 bits. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 195
53 Standard errors for CH, CH-1S and CH-NJ on the fully easy prob-
lem with 15 subproblems. : : : : : : : : : : : : : : : : : : : : : : : : 195
54 Standard errors for CH, CH-1S and CH-NJ on the fully deceptive
problem with 15 subproblems. : : : : : : : : : : : : : : : : : : : : : : 196
55 Standard errors for CH, CH-1S and CH-NJ on the distributed
fully deceptive problem with 15 subproblems. : : : : : : : : : : : : : 196
xxii
Table Page
56 Standard errors for CH, CH-1S and CH-NJ on the busy beaver
problem with 4 states. : : : : : : : : : : : : : : : : : : : : : : : : : : 196
57 Standard errors for CH, CH-1S and CH-NJ on Holland's royal
road problem with k = 4. : : : : : : : : : : : : : : : : : : : : : : : : 196
58 Standard errors for GA-S and GA-RC on the one max problem
with 120 bits. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 197
59 Standard errors for GA-S and GA-RC on the fully easy problem
with 15 subproblems. : : : : : : : : : : : : : : : : : : : : : : : : : : : 197
60 Standard errors for GA-S and GA-RC on the fully deceptive prob-
lem with 15 subproblems. : : : : : : : : : : : : : : : : : : : : : : : : 197
61 Standard errors for GA-S and GA-RC on the distributed fully
deceptive problem with 15 subproblems. : : : : : : : : : : : : : : : : 197
62 Standard errors for GA-S and GA-RC on the busy beaver problem
with 4 states. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 198
63 Standard errors for GA-S and GA-RC on Holland's royal road
problem with k = 4. : : : : : : : : : : : : : : : : : : : : : : : : : : : 198
64 Standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
one max problem with 120 bits. : : : : : : : : : : : : : : : : : : : : : 198
65 Standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
fully easy problem with 15 subproblems. : : : : : : : : : : : : : : : : 199
66 Standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
fully deceptive problem with 15 subproblems. : : : : : : : : : : : : : 199
67 Standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
distributed fully deceptive problem with 15 subproblems. : : : : : : : 199
68 Standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
busy beaver problem with 4 states. : : : : : : : : : : : : : : : : : : : 200
69 Standard errors for CH-1S, GA-E, BH-MM and BH-DMM on Hol-
land's royal road problem with k = 4. : : : : : : : : : : : : : : : : : : 200
xxiii
LIST OF ALGORITHMS
Algorithm Page
ABBREVIATIONS
Introduction
The natural world exhibits startling complexity and richness at all scales. Examples
include complex social systems, immune and nervous systems, and the intricate inter-
relationships between species. These are but a few of the wonders that have become
more apparent as we have increased our ability to examine ourselves and the world
around us. Science is an ever-changing system of beliefs that attempts to account for
what we observe, and it has adjusted itself dramatically to accommodate incoming
data. It is remarkable that so much of what we see has come to be explained by a
single theory: the theory of evolution through variation and selection.
Evolution through natural selection has profoundly a
ected our view of the world.
The society into which Charles Darwin delivered what has become known as The
Origin of Species in 1859 was primed for this change "1]. At that time, theories of
the mutability of species were rumbling through all levels of English society. Such
suggestions were in direct confrontation with the teachings of the church and were
therefore an intolerable a
ront to the position of God in the universe. In 1844, Vestiges
of the Natural History of Creation "2], published anonymously, had scandalized the
church and academia, and fanned the ames of unrest in the lower classes. In those
times, science and morality were tightly coupled. Many elds of scienti c enquiry
currently enjoy freedom of thought in an atmosphere that owes much to the revolution
wrought by the theory of evolution. It is impossible to fully appreciate the importance
and magnitude of these changes without an understanding of the social context in
which this theory developed.
Darwin, like others who promoted evolution through natural selection, was by
no means infallible. For example, he was unable to demonstrate a mechanism of
inheritance in which variation was maintained. His eventual hypothesis on the sub-
ject, the theory of pangenesis proved incorrect. It was fty years before the details
of hereditary began to fall into place, and another thirty before the \evolutionary
synthesis" solidi ed the link between the theory of evolution and the relatively young
2
eld of genetics "3, 4]. Nevertheless, Darwin had identi ed the central mechanism of
evolution: selection in the presence of variation, or \descent with modi cation" as he
called it. In many cases, the speci cs of evolution through variation and selection are
still being brought to light. However, the basic mechanisms have suced to explain,
to the satisfaction of many, an incredibly wide range of behaviors and phenomena
observed in the natural world.
It is no wonder then that computer scientists have looked to evolution for inspira-
tion. The possibility that a computational system, endowed with simple mechanisms
of variation and selection, might be persuaded to exhibit an analog of the power
of evolution found in natural systems is an idea with great appeal. This hope has
motivated the development of a number of computational systems modeled on the
principles of natural selection. Some history of these e
orts is described by Goldberg
"5] and Fogel "6].
A diculty with the hope that we might build computational systems based on
the principles of natural selection, and put these systems to practical use, is that
natural systems are undirected, whereas virtually all our activity is highly directed.
We use the computer as a tool for solving speci c problems we construct, and we
place a lot of emphasis on doing so as quickly as possible and with as little waste as
possible. Natural systems have no such aims or constraints. Survival is not directed
toward some xed target. Instead, it is a matter of keeping a step ahead, in any
accessible direction. It is a gross generalization, but I believe e
orts to date can be
grouped into two broad categories:
Useful systems that are modeled loosely on biological principles. These have
been successfully used for tasks such as function optimization, can easily be
described in non-biological language, and are often outperformed by algorithms
whose motivation is not taken from biology.
Systems that are biologically more realistic but which have not proved particu-
larly useful. These are more faithful to biological systems and are less directed
(or not directed at all). They exhibit complex and interesting behaviors but
have yet to be put to much practical use.
Of course, in practice we cannot partition things so neatly. These categories are
merely two poles, between which lie many di
erent computational systems. Closer
to the rst pole are the evolutionary algorithms, such as Evolutionary Programming,
Genetic Algorithms and Evolution Strategies. Closer to the second pole are systems
that may be classi ed as Arti cial Life approaches "7]. Both poles are interesting in
3
their own right, and I have spent time exploring aspects of each. I have nothing against
evolutionary algorithms that are used to optimize functions or against arti cial life
systems that are dicult to put to practical use. Yet it seems to me that we are still
rather far from tapping the best of both worlds and the hope of being a part of that
research, when and if it happens, is the ultimate motivation for the current work.
This dissertation is primarily concerned with evolutionary algorithms. A non-
biological perspective on these algorithms is developed and several consequences of
that position are investigated. It is important to remember that this is simply one
perspective on these algorithms. They would not exist without the original biological
motivation, and that ideal has prompted this investigation. I believe that it is im-
portant to maintain the ideal and simultaneously keep careful track of whether real
progress has been made. Evolutionary algorithms are examined through the study of
their relationship to two things:
1. The biological notion of a tness landscape.
2. Other search algorithms, particularly various hillclimbers and heuristic state
space search algorithms found in arti cial intelligence and operations research.
The rst of these relationships, the connection between evolutionary algorithms
and tness landscapes, has achieved almost folkloric recognition in the evolutionary
computation community. Practically every researcher in that community is familiar
with the landscape metaphor and does not express discomfort or concern when the
descriptions of the workings of these algorithms employ it extensively, using words
such as \peaks," \valleys" and \ridges." Surprisingly however, it is rare, within the
eld of evolutionary algorithms, to nd an actual de nition of a tness landscape.
This situation has perhaps developed because everyone grasps the imagery immedi-
ately, and the questions that would be asked of a less evocative term are not asked.
This dissertation presents a model of landscapes that is general enough to encom-
pass much of what computer scientists would call search, though the model is not
restricted to either the eld or the viewpoint. Evolutionary algorithms are examined
from the perspective the model a
ords.
The second relationship that is addressed is the connection between evolutionary
algorithms and the search algorithms of arti cial intelligence and operations research.
This relationship is not widely recognized. I believe it is a an important one. Its iden-
ti cation allows discussion of evolutionary algorithms in non-biological terms. This
leads to a view of evolutionary algorithms as heuristic state space search algorithms,
and this correspondence is used to throw light on the question of what makes a
problem hard (or easy) for a genetic algorithm.
4
to reproduce on the basis of selection based on their tness. In each case, these
algorithms were designed to mimic aspects of evolution through natural selection
with the hope that this would lead to general methods that would prove widely useful
through similar operation. Although the algorithms represent vast simpli cations of
the actual evolutionary processes, they have proved widely successful.
resolution, one way or another, may have decidedly non-subtle e
ects "24]. It is no
surprise then that evolutionary algorithms have not been the subject of wide attention
when it comes to models of computation.
When it is discussed, the model of computation that is often used when discussing
evolutionary algorithms is some variant of a black-box model. See, for example,
"5, 25, 26, 27, 28, 29]. In this model, one is presented with a black box whose properties
are unknown. The box accepts inputs and each input is rewarded with the value of
some mysterious function. The aim of this series of probings is also given, and might
be \Find an input which produces a large function value" or possibly \Find an input
that produces an output value greater than 100."
Notice that the possible vagueness of the aim of this enterprise may prevent
the provider of inputs from knowing when, if ever, it has located an input (or set
of inputs) that solve the search problem. To be more speci c, I will call the input
provider the \searcher" and the task to be addressed the \problem." This is in
keeping with the subject matter of the rest of this dissertation: Search algorithms and
problems. The black-box model is more commonly phrased in terms of \functions"
and \optimization."
Given such a model, the operation of providing an input and receiving an out-
put, often called performing a function evaluation, becomes the coin of the realm.
Algorithms may be compared on this basis|for example by comparing the number
of evaluations required, on average, to achieve a certain performance level, or by
computing statistics based on the function values achieved over time "30]. In this
dissertation, function evaluations, or, more simply, evaluations, will form the basis
for the comparison of algorithm performance. I will adopt a form of black-box model,
but before doing so, I wish to argue that without thought, it can be misleading in
two important ways:
1. Representation. The rst problem with the abstract black-box model of com-
putation arises from the fact that we rarely encounter black boxes in the real
world. The model assumes that the problem to be solved is already embodied
in some mechanical device and that our only task is to play with the input
knobs of the box. In reality, we encounter problems in the world outside the
computer, and if we intend to attempt to solve them using a computer, we have
to get them into the computer. The black-box model is silent on this aspect of
computational problem-solving even though this step is an inevitable and im-
portant part of the problem solving process. The need to formulate the problem
in a fashion that is suitable for computation requires making a decision about
how to represent the input to the black box. This is a non-trivial problem. But
7
appears to provide actually has to be constructed by the problem solver. These are
the form of the input, and the output function. In other words, the entire black box.
A naive consideration of the black-box model would take those two as given, whereas
I believe that both of them must always be constructed. This leaves us in a curious
position. The model of computation we had intended to use has seemingly vanished.
There is no black box until the problem solver creates it. The black box model is not
independent of the problem to be solved in the way the comparison-based model of
traditional analysis of algorithms is "24]. The comparison based model is useful for
this reason. It is only fair to compare algorithms based on the number of evaluations
they perform if they are using the same input representation and evaluation function.
A graphic illustration of the impact of these choices is provided by comparing
exhaustive search on the alternate representations of the eight queens problem men-
tioned above. One representation has a search space of size 4,426,165,368 while the
other is of size 40,320. Even comparing the same algorithm under a di
erent repre-
sentation leads to a di
erence in expected performance of ve orders of magnitude. If
we were to extend such comparisons to di
erent algorithms employing di
erent rep-
resentations and tness functions, our results could be even more wildly misleading.
Our model of computation then shall be based on evaluations, and we can use
it to compare algorithms, provided that the choices of representation and evaluation
function have been made and are the same for the algorithms in question. We will
acknowledge the very important human component present in the construction of a
black box. The process of choosing representation and choosing an evaluation function
requires inferencing and insight into the problem, and may be more important than
the eventual choice of algorithm to interpret the black box. This encourages caution
when making claims about the virtues of general-purpose, weak methods. These
claims tend to ignore the fact that much of problem solving lies in choices that are
not made by these weak methods.
1.4. Simplicity
Evolutionary algorithms, though often simple to describe and implement, have proved
dicult to understand. Analysis of algorithms (a phrase which, incidentally, has many
wonderful anagrams) typically involves the choice of a model of computation, proofs
of upper bounds on worst and possibly average-case behavior and proofs of lower
bounds on problem diculty. With luck, better lower bounds will be proved and
better algorithms designed until these bounds meet, to within a constant factor, and
the problem is considered solved. There are many examples of this process and many
10
1 Quicksort is a well-known example with varying behavior, but the amount of stochasticity in
that algorithm is not great enough to make detailed analysis impossible 35].
11
logical that we should seriously question our understanding of the system, and that
we should continue to ask simple questions.
This dissertation is a collection of very simple questions and my attempts to
answer them. It may appear to some that these simple questions, and the methods
that are used to address them, have little to do with evolutionary algorithms. I would
argue strongly that this is not the case, and that these questions and the methods
used to address them have everything to do with the more complicated algorithms.
For example, it is important to study the operators of these algorithms in isolation
and it is important to study the e
ect of a population-based search by comparing it
with algorithms that have no population.
Amusingly, my taste for asking simple questions is the result of time spent in
an environment in which the most complex problems are being addressed. It is not
a contradiction to attempt the study of entire complex systems by a form of reduc-
tionism that proceeds by asking a lot of simple questions. I regard the development
of this taste for simple questions to be ( nally) a sign of increased maturity. It is the
most important lesson that I will take away from three wonderful years at the Santa
Fe Institute.
1.7. Abbreviations
The remainder of this dissertation uses many abbreviations for common phrases and
algorithm names. I hope that these will not prove too intimidating to those not
already familiar with the various evolutionary algorithms. A complete list of the
abbreviations begins on page xxiv.
CHAPTER 2
A Model of Landscapes
This chapter presents the landscape model that will be used as a framework for the
remainder of the dissertation. It is assumed that one's aim is to address a search
problem. It is possible to adopt more neutral language, to the point that the model
seems completely abstract. However, adopting the language of search does not restrict
applicability too severely. One argument for neutrality is that since the model is
applicable to situations that some people would not consider search, it should be
described in a general way that will permit its easy application to these elds. A
counter argument holds that most of what we do, especially with computers, can be
phrased in terms of search. If the landscape model presented here is useful in some
eld of endeavor, then it is probably possible to describe that endeavor as a search.
As will be seen, choosing to view a wide range of situations from a search perspective
is not without precedent.
2.1. Motivation
Given the aim of this chapter, to present a model of landscapes, there are many
preliminary questions that should be addressed before the model is presented. Three
of these are particularly important.
What is a landscape?
A landscape is one way to view some aspects of a complex process. As such,
it is a tool that we hope to use to increase understanding of the process, to
suggest explanations and ideas, and to provide an intuitive feeling for those
aspects of the process. This description is vague because landscapes may be
used in a wide variety of situations. Wright employed the metaphor to provide
a simpli ed and intuitive picture of his work on the mathematics of gene ow in
a population "37]. We will use landscapes to produce a simpli ed view of some
14
aspects of search. The metaphor has been used in this way for some time to
describe evolutionary algorithms, though rather informally. As a result, I think
the current use of the metaphor does not always produce bene cial results.
Support for this opinion will be presented shortly.
It is important to understand what I believe a landscape is and how that will
a
ect the presentation of the model in this chapter. In particular, a landscape
is neither a search algorithm nor a search problem. As a result, to formally
describe a landscape, it is not necessary to formally describe all aspects of all
search algorithms or all aspects of all search problems. For this reason, we
will examine search algorithms and problems of a certain kinds and will omit
many details of algorithms and problems. The landscape model will be widely
applicable to search, but it is far from a complete picture of search.
A landscape is a structure that results from some of the choices (by no means
all) that are made when we use a computer to search. It presents only a sim-
pli ed view of aspects of the search process. A landscape is a tool and it is a
convenience. The model of landscapes presented in this chapter aims to formal-
ize the notion of a landscape and to make the result as general and useful as
possible for thinking about computational search.
Why the interest in landscapes?
My interest in the landscape metaphor stems from an interest in evolutionary
algorithms. In the community of researchers working on these algorithms, the
word \landscape" is frequently encountered|both informally, in conversations
and presentations, and formally, in conference and journal papers. It is natu-
ral therefore that one should seek to understand what is meant by the word.
Remarkably, it proved dicult to nd out exactly what people meant by the
word, and when that could be discovered, the de nitions did not concur. In
my opinion, the landscape metaphor is a powerful one and because its meaning
is appears intuitively clear, people are inclined to employ it rather casually.
Questions that would be asked of a less familiar and less evocative word simply
do not get asked. After all, everyone knows what a landscape is. Or do they?
If the term was only used in an informal sense or for the purposes of rough il-
lustration, the situation would be more tolerable. Instead, properties of tness
landscapes are used as the basis for explanations of algorithm behavior and this
is often done in a careless fashion that is detrimental to our understanding.
Thus, the motivation for trying to establish just what is (or should be) meant
by a landscape comes from (1) the fact that the word is in widespread use, (2)
15
the belief that it is usually not well-de ned, and (3) the belief that this lack
of de nition has important consequences. I will take the rst of these as given
and argue below that the two beliefs are justi ed.
What is wrong with the current denitions of landscapes?
The following are some of the current problems with the use of the landscape
metaphor in the eld of evolutionary computation. The use of landscapes in
other elds will be discussed in Chapter 6. Some of the terminology used, (e.g.,
operator ), will not be de ned until later in the chapter.
1. In many cases, landscapes are not de ned at all. This is very common
informally, but also happens quite frequently in formal settings. It is also
common for papers to reference an earlier paper that contains no de nition
or a poor one. This practice has decreased in the last year.
2. Another common problem is vague de nitions of the word. For example,
one encounters statements such as \the combination of a search space S
and a tness function f : S ! R is a tness landscape." Apart from the
fact that what has been de ned here is simply a function, it might not be
immediately clear what is wrong with such a de nition. The problem is
that virtually every term we would like to use in describing a landscape
depends on being able to de ne neighborhood but the above de nition tells
us nothing about neighborhood. For example, perhaps the most frequently
used landscape-related term is \peak." But what is a peak? A simple
de nition is: A point whose neighbors are all less high than it is. It is
not possible to sensibly de ne the word \peak" without de ning what
constitutes neighborhood. The need to be speci c about neighborhood
is not widely attended to. As a result, in many cases, it is not clear
what is meant by the word landscape. The expressions \ tness function"
and \ tness landscape" are often used interchangeably, both formally and
informally. A tness function is a function: a mapping from one set of
objects to another. There is no notion of neighborhood, and one is not
needed. A tness landscape requires a notion of neighborhood.
3. Even when neighborhood is de ned, it is often done so incorrectly. Infor-
mally, people will staunchly defend the binary hypercube as the landscape
when an algorithm processes bit strings, even if the algorithm never em-
ploys an operator that always ips exactly one bit chosen uniformly at
random in an individual. Certainly this hypercube may have vertices that
t the above de nition of a peak, but the relevance of these points to such
16
2.2.1. Multisets
A multiset is
\: : : a mathematical entity that is like a set, but it is allowed to contain
repeated elements an object may be an element of a multiset several
times, and its multiplicity of occurrences is relevant."
Knuth "40, page 454].
For example, f1 1 4 5 5g and f4 1 5 5 1g are equivalent multisets. In contrast,
f1 1 2g and f1 2g are equivalent as sets, but not as multisets. Of course, any set is
18
also a multiset. Binary operations , ] and \ between multisets that obey commu-
tative, associative, distributive and other laws can be simply de ned "40, page 636],
though we will have no use for them in this dissertation. Knuth mentions a number of
situations in which multisets arise naturally in mathematics. Kanerva makes exten-
sive use of multisets in his analysis of Sparse Distributed Memory "41]. Given a set S ,
I will use M(S ) to denote the in nite set of all multisets whose elements are drawn
from S . Thus the multiset above is an element of M(f1,4,5g), as are the multisets
f1g, f4 4 4 4g, f4 4g and f1 4 5g itself. I will use jSj to represent the number of
elements in the multiset S . Hence, jf1 5 5gj = 3. Finally, de ne
Mq (S ) = fs 2 M(S ) : jsj = qg
as the set of multisets of S with cardinality q.
can be thought of as posing search problems. As a nal link in this chain of reasoning,
Rich claims that
If we take these three excerpts seriously, search is not only ubiquitous, but it can be
described in terms of a process on a graph structure. On setting out to develop a
model of tness landscapes, I soon arrived at the conclusion that a landscape could
be conveniently viewed as a graph. Some time later, it became apparent that this was
a common view of search in other elds and I took pains to ensure that the model
was general enough to be appropriate for discussions of these algorithms also. Given
such a broad target, it is necessary to develop the model in a very general way. I am
particularly concerned that this be done in a mathematically precise way that will
allow formal de nitions and discussion when that is needed. This is similar to the
aims expressed by Banerji "45].
In setting out to present a model of computational search, there are several
fundamental issues that need to be initially addressed. I assume, in line with the
quotation due to Pearl above, that a search problem takes the form \Find an object
with the following properties: : : ." Then search is conducted amongst a (possibly
in nite) set of objects (potential solutions). As was emphasized in x1.3(5), the rst
step in solving a search problem involves making a choice about the contents of this
set. I will call this set of objects the object space, and it will be denoted by O. This
choice will often be a simple one and in many cases a good candidate for O will be
provided in the problem statement itself. This decision is typically made before a
computer is introduced as an aid to search. Example object spaces include the set
of n-tuples of real numbers, the set of binary strings of length n, the set of LISP
S-expressions, the set of permutations of the integers one to ten, the set of legal chess
positions, the set of spin con gurations in a spin glass and the set of RNA sequences.
A second step that is common when searching is to decide on some representation
of the objects in O. The representation determines a set, R, the representation space,
that can be manipulated more conveniently than O by the searcher. The model of
search of this chapter does not require that R 6= O, but this will be true throughout
this dissertation. In situations where R = O, (i.e., no representation is employed),
one can substitute O for R in the model. When performing search via a computer, R
21
will very commonly be the result of a choice of data structure, D.y In this dissertation,
this will always be the case. The possible ways of assigning values to the bits that
comprise the memory corresponding to an instance of D determines R. Elements of
R will be used to represent elements of O. Representation via instances of a data
structure is not the only possible choice in computational search, it is merely the most
common. Forms of search involving analog computation may have representations
involving resistors (for example), in which case the representation space may have
nothing to do with data structures "46].
Unlike the object space, the representation space is necessarily nite. Computers
have nite limits and therefore, in general, it is not possible to represent the whole of O
with R, at any given point in time. Such a situation is very common in evolutionary
algorithms. For this reason, only a nite subset O0
O will be represented at
any given time. The mapping between elements of O0 and R will be called the
representation.2 The representation is the link between the objects we choose to
examine as potential solutions to the search problem and the objects we choose to
manipulate to e
ect the search. When the search is nished, we invert this mapping
to produce the solution.
A relation, ;, (\is represented by") between the sets O and R implements this
representation. For o 2 O and r 2 R, we will write o ; r to indicate that o is
represented by r. The inverse relation ;;1 (\represents") between the sets R and
O is de ned as ;;1 = f(r o) j (o r) 2 ;g. We will write r ;;1 o to indicate that r
represents o. The (possibly empty) subset of R representing o 2 O will be denoted
by ;(o) and the (possibly empty) subset of O that is represented by r 2 R will be
denoted by ;;1 (r). If ;(o) 6= , we will say that o is represented. If ;;1 (r) = we
will say that r is illegal. R0 = fr 2 R j ;;1(r) 6= g will be used to denote the set
of elements of R that are not illegal. Consideration of how an algorithm restricts its
attention to R0 or manages excursions into R;R0 are discussed in x2.12(41). There
is nothing to stop an algorithm from adopting a new O, O0, R, R0, or ; at any point,
for example see "48, 49, 50, 51].
At this point, we have almost enough to talk about search. We need to be slightly
more speci c about the nature of the search problems in which we are interested. As
mentioned above, we do not need to be too speci c about the details of the search
problems if those details do not inuence the landscapes that we will construct. In
this dissertation I will be concerned with two types of search problems, which I call
1. A Type 1 search problem requires the location of an object (or objects) possess-
ing a set P of properties. The search is only satis ed when such an object is
located. The statement of the search problem includes no notion of \closeness"
to a solution. Either a solution has been found and the search is successful, or
one has not been found and it is unsuccessful. It is convenient to imagine the
statement of the search problem as providing a function g : O ! f0 1g de ned
as follows: 8
>< 1 if o possesses all properties in P .
g(o) = >
: 0 otherwise.
which can be used to determine if an object satis es the requirements of the
search problem.
Our search algorithms represent potential solutions to the problem using in-
stances of the data structure D. At point t in time, the algorithm will have memory
allocated for a nite set of these, and the values they contain will form a multiset,
Ct 2 M(R). When we are not concerned with time, we will drop the subscript and
simply use C to denote the current multiset of R. In the discussion of landscapes to
come, we will be concerned with C and with the methods used by the algorithm to
change it. The algorithms employ operators to modify C . In this dissertation, lower
case Greek symbols will be used exclusively to refer to operators. In particular,
(and less frequently ), will be reserved to indicate generic operators.
An operator is a function : M(R) M(R) ! "0::1]. The value of (v w) = p
for v w 2 M(R) indicates that with probability p, v is transformed into w by a single
application of the stochastic procedure represented by the operator . Thus, for any
23
The details of the workings of an operator have deliberately not been speci ed.
As a result, operators are so broad in scope, that we may regard every method an
algorithm uses to a
ect C as taking place through the action of an operator. Thus
there are operators that (1) increase the size of C through the allocation of memory
to serve as instances of D, (2) reduce the size of C through freeing allocated memory,
and (3) alter the contents of existing members of C . Some operators may combine
any of these three actions. Designating some action of the algorithm as being the
result of an operator can be done quite arbitrarily. For instance, if we examine Ct and
later examine Ct+ , we can imagine that the algorithm has employed some operator
that transformed Ct into Ct+ that took time units to complete. Depending on the
value of , we may be talking of the e
ect of a single microcoded instruction in the
CPU or of the action of the entire algorithm.
Clearly, some operators will have more e
ect on the result of the search than
others. Though crucial for the algorithm's operation, an operator that increases the
size of C by one (through memory allocation) will typically be uninteresting, as few
algorithms deliberately make use of uninitialized memory. An initialization operator
that sets the bits of an element of C may be somewhat more interesting. What we
choose to consider an operator will depend on what we are interested in studying.
In this dissertation, we will be concerned with operators such as mutation, crossover
and selection in evolutionary algorithms. These represent one natural perspective on
the working of these algorithms and reect an interest in the importance of these
operators. These operators are natural, in an informal sense, since they are typically
implemented as separate procedures in the programs that implement the algorithms.
Another natural perspective is to consider a generation of an evolutionary algorithm
to be an operator. Such an operator usually includes some combination of the three
ner-grained operators just mentioned. This is the perspective that has (so far) been
adopted by researchers interested in understanding aspects of the genetic algorithm
24
through viewing it as a Markov chain (see "52] for example). The set of operators of
interest in an algorithm will typically be only a part of the algorithm. The algorithm
must also make important decisions about which operators to apply and when, decide
what elements of C should be acted on by operators, and decide when the search
should be terminated.
Given an operator , the -neighborhood of v 2 M(R), which we will denote
by N(v), is the set of elements of M(R) accessible from v via a single use of the
operator. That is, N(v) = fw 2 M(R) j (v w) > 0g. If w 2 N(v) we will say
that w is a -neighbor of v. When the operator in question is understood, the terms
\neighborhood" and \neighbor" will sometimes be used. However, it is important
to keep in mind that this is an abbreviation. Two points that are neighbors under
one operator may not be under another. Given P
M(R), we will use N(P ) to
represent the set of elements of M(R) ; P that neighbor an element of P . That is,
N(P ) = fw 2 M(R) ; P j (v w) > 0 and v 2 P g:
A situation that we will often encounter arises when an operator replaces an
element v 2 C by choosing one of a subset of N (v). That is, the job of is to
replace v with one member of the neighbors of v as de ned by a second operator .
An example will make this clearer. Suppose we have a hillclimbing algorithm that
R is the set of all binary strings of length n and that v 2 R is the current location
of the hillclimbing search. Operator ips a randomly chosen bit in a binary string.
Operator uses some number of times to generate a subset of N (v). These are
placed by into temporary locations in C . When some number of neighbors of v
under have been generated, or when one is found that is suitable, replaces v with
the selected neighbor. Another example of this sort of operator is that which e
ects
a move in a simulated annealing algorithm "53]. It employs a mutation operator to
generate neighbors of the current point until it nds an acceptable one and sets the
memory corresponding to the current point to the value of the selected neighbor.
This brings us to the nal aspect of search algorithms that we will consider
before describing landscapes. When an algorithm makes a decision about the relative
worth of a set of multisets of R (such as the neighbors generated in the example
above), it must have some basis for this decision. Although the decision can be taken
by simple use of a pseudo-random number generator, it is more common that the
algorithm will have some way of computing the worth of an element of M(R). We
will suppose the search algorithm has a function f : M(R) ! F , for some set F , and
a partial order >F over F . If v w 2 M(R) and f (v) >F f (w) then the multiset v
will be considered in some sense better for the purposes of continuing the search than
25
2.4. Landscapes
A landscape is dependent on ve of the components of search and algorithms discussed
in the previous section. We may write a landscape as
L = (R f F >F ):
The components are, respectively, the representation space, an operator, the function
f : M(R) ! F , for some set F , and a partial order >F over F . As emphasized
in earlier sections, a landscape is a metaphor by which we hope to imagine some
26
aspect of the behavior of an algorithm. That can be done by viewing the 5-tuple
as de ning a directed, labeled, graph GL = (V E ) where V
M(R), E
V V
and (v w) 2 E () (v w) > 0. A vertex v 2 V will be labeled with f (v). An
edge (v w) will be labeled with (v w), the probability that the action associated
with the operator produces w from v. Though a landscape is formally de ned by a
5-tuple, we will also talk of a landscape as though it were the graph that arises from
the 5-tuple.
The label f (v) attached to a vertex v can be thought of as giving the \height" of
that vertex. This is in keeping with the imagery we usually associate with landscapes.
This value will often be referred to as the tness of the multiset of R represented
by v. The partial order, >F , is used to determine relative tness (height). In many
cases, >F will actually be a total order. The out-degree of a vertex in a landscape
graph will be the same as the size of the neighborhood of the corresponding multiset.
When (v w) = (w v) for all v w 2 V , we will consider the landscape graph as
undirected and draw a single edge between v and w with the understanding that the
edge is bidirected. It should be remembered that the vertices of the landscape graph
correspond to multisets of elements from R, not to single elements of R. This is
important, because it allows us to de ne landscape graphs for arbitrary operators,
not just those that act on and produce a single element of R (e.g., mutation in a
genetic algorithm). Landscapes in this model are well-de ned no matter how many
elements of R the operator acts on or produces, even zero. It will also be important to
remember that each operator employed by a search algorithm helps create a landscape
graph. Thus if an algorithm employs three operators, it can be thought of as traversing
edges on three graphs.
2.5.3. global-maximum
A global-maximum or global-optimum of a landscape is a vertex
v2V j 8w 2 V f (v) >F f (w):
That is, a vertex is a global-maximum if it is at least as t as every other vertex. If
a vertex has maximal tness, then it will be a global maximum no matter who its
neighbors are. For this reason, we can call a vertex a \global maximum" without an
operator pre x. If a vertex is a global maximum under one operator, then it will also
be under all other operators.
2.5.4. -local-maximum
A -local-maximum or -local-optimum is a -peak that is not a -global-maximum.
In this case, unlike with global maxima, the pre x is important. A vertex v that
28
2.5.5. -plateau
A -plateau is a set of vertices
M
V jM j > 1 : 8 v0 vn 2 M 9v1 : : : vn;1 with
f (vi) = f (vi+1) and vi+1 2 N(vi) 8 0 i < n:
This is a connected set of at least two vertices that all have the same tness. It is
possible to move using between any two vertices visiting only vertices with equal
tness. Our usual three-dimensional image of a plateau involves a at area in which,
for the most part, a step in any direction will not result in a drop in altitude. This
is not the case in the above de nition, and it is important to keep this in mind. For
example, a -plateau may surround areas of exceptionally low tness, as does the rim
of a volcano. In such cases, there may be no points on the plateau that have the same
tness as all their neighbors. De ning a plateau to better t our intuitions is dicult
and besides, what we will most often be dealing with, at least in the chapters that
follow, will be a -plateau as de ned here.
2.5.6. -mesa
A -mesa is a -plateau, P , whose points have tness k such that
8v 2 N(P ) f (v) <F k:
That is, a -mesa is a -plateau with the additional property that no point on the
plateau has a neighbor of higher tness. Such a set is a connected region of the
landscape in which an algorithm that only moves to vertices of equal or higher tness
may wander inde nitely without making any improvement. A -mesa of size one is
also a peak.
2.5.7. -saddle-region
A saddle point is a notion that is usually associated with continuous spaces. It is not
immediately clear how to de ne a saddle point in a high-dimensional discrete space.
If it really is to be a point and it is a saddle in the normal sense, then one de nition
29
would simply require the point to have at least one uphill neighbor and at least one
downhill neighbor. As this will be true of most points, the avor of a saddle point in
a real space is lost. The term becomes practically useless since the only points that
are not then saddle points are peaks (maximal or minimal).
A better de nition is to use -saddle-region to refer to a -plateau that is not a
-mesa. This is a region rather than a point. This de nition also loses an important
part of the avor of a saddle point, but at least it is useful and the name change
highlights the loss.
2.5.8. -basin-of-attraction
The -basin-of-attraction of a vertex vn is the set of vertices
B(vn) = f v0 2 V j 9v : : : vn; with
1 1
vi 2 N(vi) 8 0 i < n g:
+1
Thus the basin of attraction of a vertex v is the set of vertices from which v may be
reached using . Notice that w 2 B(v) 6) v 2 B(w).
2.5.12. -connected-components
If an operator is symmetric, we will often talk about the -connected-components of
a landscape graph. Two vertices v w 2 V are -connected in GL if there is at least
one path between them (irrespective of edge directions). Following the de nitions of
connectedness in x2.2.2(18) leads naturally to -connected landscapes. The operator
associated with a connected landscape is necessarily walkable, but the converse is not
true, as will be demonstrated in Chapter 3. If
(v w) = (v x) 8v 2 V and 8w x 2 N(v)
we will usually not label the edges of GL with the transition probabilities. Figure 1
illustrates a situation where transition probabilities are not equal.
0000,1011
2/3 1/3
1000,0011 0001,1010
1/3 2/3
1001,0010
100 101
000 001
010 011
110 111
Figure 2. The landscape for the bit-ipping operator () on binary strings of length
three. Edges are bidirectional and each has probability one-third.
3
q
000
2 2
pq pq
2 2
3 pq pq 3
q q
010 001
2
pq
2 2
pq pq
3 3
p p
3 3
p p
3 3
q 011 101 q
3 3
p p
3 3
p p
2 2
pq pq
111 100
3 3
q q
2 2
pq pq
110
3
q
Figure 3. The mutation landscape for binary strings of length three. The mutation
probability is p, and q = 1 ; p. Some edge probabilities are omitted. Edges are
bidirectional.
34
1 1
1/2
1/2 1/2
101,101 010,011
1 1
100,111 101,110
110,110 010,110
1 1 1/2
1/2 1/2
111,111 011,111
1 1
100,101
1
100,110
1
101,111
1
110,111
1
The model presented above has a number of consequences that are not found in other
landscape models. Two of these may seem particularly strange. First, as described
above, a landscape may not be walkable. As mentioned in x2.5(26), an example is the
landscape induced by any form of crossover that produces one child from two parents.
Such an operator cannot be used to conduct a walk on the landscape as the output
of the operator cannot be used as the next input.
Second, landscapes may not be connected. A simple example is the landscape
induced by a (non-GP) crossover operator that produces two children from two par-
ents. Consider a vertex of the landscape that corresponds to two points of R that
are identical. The vertex will be connected to itself (with probability 1), and nothing
else. A more speci c and less trivial example of a landscape that is not connected is
seen by considering the vertex (011 010) of V where R = f0 1g3 . Clearly no form
of crossover can transform this input to a pair of points either of which begins with
a one, e.g., (100 000). Equally clearly, no composition of crossovers can accomplish
this either. Thus (011 010) is not connected to any vertex in the landscape which
contains a member of R that starts with a 1 (in fact, this vertex is also connected
only to itself). Several complete crossover landscapes are presented in Chapter 3.
The possibility that landscapes may not be connected means that on these land-
scapes there is no general notion of distance between landscape vertices. This does
not mean that a metric cannot be de ned, just that there may be no natural one (such
as the length of the shortest path between the vertices concerned). The model does
not require a distance metric to exist, though the absence of one might make certain
statistics meaningless or impossible to compute. Within a connected component of
a landscape, one can always use the length of the shortest path between two vertices
as a distance metric. The subject of distance metrics and dimensionality is addressed
in x2.9(39).
Because a landscape may not be walkable, general statistics describing properties
of landscapes may be restricted to using the information gained from repeated single
applications of the operator that generated the landscape. This sort of statistic was
employed by Manderick et al., even though they were considering a walkable landscape
"60].
37
a step is taken there before the cycle repeats. This process is depicted in Figure 5.
Under this view, the GA is taking single steps on the various landscapes but cannot
be said to be walking on any of them. In some cases, obviously, a point may survive
many such cycles and take several steps on the mutation landscape. In other cases
the population might be highly converged, in which case crossover and mutation will
be producing little change and it could be argued that some or even most of the
population is walking on the mutation landscape or that the entire population is
walking on the selection landscape. But these cases are the exception rather than the
norm.
Mutation Landscape
Population of
individuals.
Selection Landscape
Some move
under mutation.
Individuals
paired for
crossover. Some pairs move
under crossover.
Crossover Landscape
The model establishes a point of contact with the search algorithms of AI and
OR. In all cases the underlying structures being searched are graphs. This
relationship is explored more fully in Chapter 5.
The model does not make the assumptions that prevent other notions of land-
scape from being more widely used. For example, the landscape does not need
some xed dimensionality, nor does it need a distance metric between the ob-
jects that compose the landscape. For these reasons, it is possible to view many
AI problems as being problems on landscapes of this type. For instance, one
could conceive of a landscape for chess or Rubik's cube. The navigational task
on these landscapes may di
er, or the operators may be complex, but the un-
derlying structures are the same. The model is also useful within the eld of
evolutionary computing. It applies as well to GP as to GAs. Statistics that can
be calculated for a landscape in one paradigm can be calculated in exactly the
same way for another. The model also provides a framework for thinking about
HC, SA, ESs and EP.
The model invites a point of view that seems uncommon in the eld of evolu-
tionary computation, though not in AI. This is a view of search as navigation
and structure. Once we view search in this way and identify the various compo-
nents present in a GA, it is natural to ask questions about them. This division
and the recombination it makes possible are examined in detail in Chapter 3.
The \one operator, one landscape" view reveals the very di
erent landscapes
that are constructed by various operators. This invites statistical analysis of
the landscapes, as will be seen in Chapter 5 and as has been done in "60, 61, 62,
63, 64]. Such analysis has the advantage of being independent of any particular
navigation strategy. For this reason, it may be possible to demonstrate that a
particular operator creates dicult (in some sense) landscapes for some problem
types. Such results might go a long way towards resolving debates on the virtues
of certain operators. Statistics such as these would be very useful as indicators
of potential diculty (or ease) of a problem for an operator.
40
It can be argued that the model is not particularly useful in situations where the
nature of the operators changes in the course of search. For example, the change
of mutation vector for an individual in ES and EP, the change of edge probabilities
when the temperature falls in simulated annealing or after inversion in a GA, or where
the mapping between objects and representation space is changed, e.g., in dynamic
parameter encoding "49], delta coding and delta folding "48]. There is some truth
to this argument. However, if one ceases to regard a search structure as something
necessarily xed for all time, the model is still potentially useful. For example, a
statistic, say correlation length, might be computed for the landscapes generated by
a range of di
erent temperature settings in a simulated annealing problem. This
might provide useful information about when the search could be expected to make
good progress (thus guiding the choice of cooling schedule) and it might prompt a
comparison with a hill climber or other algorithm on one of the landscapes. These
situations make the landscape something of a moving target, but the targets can be
studied individually.
More generally, an algorithm might change many aspects of its behavior, for
example the tness function or the representation space, mid-run, and thereby shift its
attention to new landscapes. This might be done very frequently. In these cases, the
landscapes model of this dissertation holds that each such change produces potentially
new landscape structures and that these can and should be studied independently.
This is not a claim that the family of landscapes used by an algorithm should not be
studied as a whole, just that it is possible to study the components in isolation, that
this will be a simpler task and that it is worthwhile.
The fact that the model does not require a distance metric is not a limitation.
There is nothing in the model that prevents the de nition of a distance metric on R
or V , and if this proves useful, it should be taken advantage of. In cases where the
landscape is fully connected, there is always a natural de nition of distance, and this
can be used to compute such things as correlation lengths "65]. In other cases there
may be no useful de nition of distance. That the model does not provide one is not
a shortcoming. It cannot be denied that the algorithm is making moves on the graph
de ned by the operator. A landscape model that provides a concrete, well-de ned
graph which can be studied, is far better than nothing.
41
that violate constraints. These situations are very common in evolutionary algo-
rithms, and a number of ways of dealing with them have been adopted. They arise in
even the simplest GA applications (for example, using 2 bits to represent 3 objects),
in algorithms that manipulate permutations (where crossover can easily produce a
non-permutation), and with oating point representations. There have been several
approaches to dealing with these problems:
Probably the most common solution is to build special operators that produce
legal representations from other legal representations. This is a common ap-
proach when manipulating permutations of integers. Operators for this include
Cycle Crossover "58], Order Crossover "58], Partially Matched Crossover "68],
Edge Recombination "59, 69, 70], the crossover of Gorges-Schleuter "71], Maxi-
mal Preservative Crossover "72], Strategic Edge Recombination "73], and Gener-
alized N-point Crossover "67]. Michalewicz describes special-purpose operators
designed to stay within a feasible region of Rn given by linear constraints "74].
Another solution is to allow these illegal representations but to penalize them
somehow to encourage the algorithm to avoid these regions of the space, possibly
with increasing probability over time "5, 54, 75, 76, 77, 78, 79].
Davis and Steenstrup "80] suggest that the problem can be dealt with by \decod-
ing" illegal individuals before evaluating them. They do not give an example,
but claim the procedure is often computationally intensive.
Another approach is to allow the operators to construct illegal representations
but to repair the result "81, 82], a procedure called \forcing" by Nakano et al.
"83].
If the size of the subset of R that is legal is not too small, a solution is to
generate individuals repeatedly until a legal one is found.
A recent elegant solution, proposed by Bean, adopts a new representation of a
permutation that allows traditional operators to be used "84].
This dissertation does not attempt to treat representational issues such as the
above, though this aspect of search is as important as any other (see x2.13(42)).
as possible solutions to the search, was illustrated in x1.3(5) with the eight queens
problem. This is by no means an isolated example. This choice might not seem part
of the search algorithm, but it is simple to view it as such. Perhaps the reason why
this aspect of search is not so considered is that we typically have no idea how such a
choice could be performed by a machine. The choice requires experience, insight and
creativity. It is possible for an algorithm to make this choice and to explore various
answers during the course of a search, but currently we do not know how the process
works well enough to design an algorithm to perform it. If this component of search
eventually falls into the domain of the computation, arti cial intelligence will have
taken a signi cant step.
The choice of the representation space, R, has similar importance and is also
typically made by the programmer, not the program. An important issue is that of
what I call over-representation and under-representation. Over-representation occurs
when many elements of R are interpreted (according to ;;1 ) as corresponding to the
same element of O. From a problem solving point of view, such a choice is strictly
redundant, but it may have advantages that we do not yet fully appreciate "66, 67].
The mapping between RNA primary and secondary structure is highly redundant in
this sense. GP has a similar avor every S-expression is a member of an in nitely
large class of S-expressions that are all functionally equivalent.
Under-representation is very common in problem solving. A dramatic and elegant
example is Kanerva's Sparse Distributed Memory in which an extremely large address
space (e.g., of 21000 locations) is represented in a conventional-sized memory "41].
As mentioned earlier, an object space that involves an in nite set must be under-
represented at any one point in time. Genetic algorithms have traditionally under-
represented real intervals via a discretization of the interval indexed by the binary
value of a bit string. Using a oating point representation also under-represents the
reals, but in a far less drastic manner. Work on issues of representation, particularly
changing representation, can be found in "47, 48, 49, 50, 51, 85].
A representational choice that is common in AI and OR has the elements of R
represent partial objects from O. A partial object can also be viewed as representing
the class of objects that in some sense contain the partial object. For example, a
sub-tour in a graph can be thought of as a partial tour, but also as a representative
of all those complete tours that include the sub-tour. This form of representation
is the basis of the \split-and-prune" "31] paradigm of OR and it underlies all of the
many variants of the branch-and-bound algorithm. A schema in a GA is also repre-
sentative of a class of objects (in this case binary strings), though the GA does not
44
65
6
84
92 3 71
63
8
31
30
Figure 6. The landscape generated by a perfect operator. Every point in the space,
except the global optimum is connected directly to the global optimum. Vertices
are labeled with tness values. Any search strategy using this operator will nd the
global optimum quite rapidly.
harder to construct would result in a dicult problem. This operator connects each
vertex directly to the vertex that has the next highest tness. This is essentially
the landscape that was produced by Horn et al. in "86], though that landscape was
produced by combining a very speci c tness function with an ordinary operator
and this is produced by combining a very speci c operator with an ordinary tness
function. An example is shown in Figure 7. The importance of the choice of tness
function is discussed at length in the second half of Chapter 5.
3 This is not strictly true if a binary string with no don't care symbols is regarded as a schema,
which it is sensible to do. A more accurate statement would assert that schemata of order greater
than zero are not explicitly manipulated by the GA.
45
Global maximum
100
82 84
63
55 47
41
38
35
30
22
21
16
4
The point of these remarks and illustrations is simply to emphasize that in all
cases these choices have to be made and that the choices that are made may have a
tremendous impact on the diculty of a problem. I am not suggesting that, given a
search problem, we should search for the perfect operator or tness function. In every
case, the choices that are made determine a landscape as described by the current
model. This suggests instead that we should try to develop techniques to study these
structures, and attempt to apply what we learn to the problem of deciding whether
one choice in their construction appears better than another. The landscape model
that underlies these structures is in all cases a graph, so a technique developed to
study an aspect of one landscape can be automatically applied to many others.
enhance our understanding of some process, to develop new ideas for exploring spaces
and to stimulate questions about processes operating on these structures. All of this
tends to rely rather heavily on the simple properties that we see in physical three
dimensional landscapes. It is not clear just how many of the ideas scale up to land-
scapes with tens or thousands of dimensions. It is quite possible that the simplicity
and beauty of the metaphor is actually damaging in some instances, for example by
diverting attention from the actual process or by suggesting appealing, simple and
incorrect explanations. Many of these potential problems are summarized by Provine
"87, pp. 307{317], which should be required reading for people interested in employing
the metaphor. Wright's response to Provine's criticism is that his landscape diagrams
were intended as a convenient, but simplistic, representation of a complex process in
a high-dimensional space "37], and were never intended for mathematical use.
As outlined in x2.1, similar problems exist in evolutionary computation. Given
this, it is worth asking whether it is better to abandon the term or to use it and try
to be more precise about what is actually meant. There is something to be said for
abandoning it. After all, in virtually every formulation, a landscape can be regarded
as a graph. On the other hand, it seems unlikely that the term will just go away. In
addition, the metaphor, however distant it may sometimes be from reality, has given
rise to new ideas and intuitions. I have chosen to adopt the term, with the hope that
it will lessen, rather than increase, the vagueness with which it is applied.
2.15. Conclusion
This chapter presented a general model of landscapes and an overview of its conse-
quences, advantages, limitations and relevance to evolutionary algorithms. The model
views a landscape as a directed graph whose edges and vertices are labeled. It was
argued that the operators in evolutionary algorithms each generate a landscape, that
these landscapes have di
ering qualities, and that each can and should be studied
in its own right. Thus most evolutionary algorithms are seen as operating on mul-
tiple landscapes. De ning a landscape as a graph establishes a contact with search
algorithms from arti cial intelligence and operations research, many of which are ex-
plicitly designed to search labeled graphs. The model advocates a view of search as
composed of navigation and structure, with the structure provided by landscapes. It
is argued that the statistical properties of landscapes can be studied independently
of navigation strategies. The relationship between this model and other work on
landscapes is dealt with in detail in Chapter 6. Most of the issues touched on in this
chapter will be encountered in the chapters that follow.
CHAPTER 3
mechanics of crossover over and above what it could be gaining from the mechanics
alone. In some cases, when well-de ned building blocks are not present, the GA
may actually perform worse with normal crossover than a GA with random crossover
because of its use of a population. This test gives an indication of the existence of
building blocks that are exploitable by a GA with a given crossover operator.
rithm that operates on binary strings using the bit-ipping operator, , described in
x2.6.1(31). Each of the vertices in the landscape graph of corresponds to a binary
string, and each has an associated tness. If we suppose that the algorithm starts
operation at a randomly chosen vertex, v, to which vertex of N (v) does it move next?
This decision is made by the navigation strategy. Notice that we have assumed the
algorithm has made the choice to use the operator to generate a new binary string.
This choice is part of the navigation strategy, though in such a simple algorithm there
is not a lot of choice about what to do next if the navigation strategy does not halt the
search. The choice of the next vertex to move to is made by the navigation strategy,
so is the number of elements of N (v) to examine before making the move. Having
examined a number of next possibilities, the navigation strategy selects one, usually
on the basis of the tnesses observed.
The particular solutions to the choices above are usually responsible for the name
given to the algorithm. For example, if all the neighbors are examined and one of
those with maximal tness is chosen to move to, the hillclimbing algorithm is called
Steepest Ascent. If the navigation strategy examines neighbors until one with better
tness is found and then moves to it, we shall call the hillclimbing algorithm Any
Ascent. Other aspects of algorithms that will be considered part of the navigation
strategy include such things as: Deciding when to stop, deciding which operators to
use and when, deciding how and when to change the algorithm's temperature variable
(should one exist), deciding to adjust the representation or the mapping ; between
the object space (O) and representation space (R) to focus on a speci c area of O,
deciding on population size (if the algorithm maintains a population), and deciding
on a balance between exploration and exploitation.
two parents (22!2), Figure 10 shows the landscape generated by uniform crossover
that produces two children from two parents (2u!2), and Figure 11 is the landscape
generated by one-point crossover that produces one child from two parents (21!1). A
more general gure for parameterized uniform crossover "95] would have probabilities
on the edges, similar to those in Figure 3 on page 33.
1This is a practical yardstick, and is motivated by the belief that when choosing an algorithm
to solve a problem, one is chiey motivated by the amount of time that will be spent awaiting a
solution. If the number of function evaluations performed by an algorithm is proportional to its run
time, then concentrating on the expected number of evaluations is appropriate.
51
1 1
1/2
1/2 1/2
101,101 010,011
1 1
100,111 101,110
110,110 010,110
1 1 1/2
1/2 1/2
111,111 011,111
1 1
100,101
1
100,110
1
101,111
1
110,111
1
001,011 2/3
011,011 1/3 1/3 010,101
1 1
1 1
2/3
101,101 010,011 1/3 1/3
1 1
100,111 101,110
110,110 010,110
1 1
2/3
1/3 1/3
111,111 011,111
1 1
100,101
1
100,110
1
101,111
1
110,111
1
Figure 9. The two-point crossover landscape for binary strings of length three. The
crossover operator, 22!2, produces two ospring from two parents. Edges are bidi-
rectional.
53
100,101
1
100,110
1
101,111
1
110,111
1
Figure 10. The uniform crossover landscape for binary strings of length
2!2
three. Bits
are chosen from parents with probability 1=2. The crossover operator, u , produces
two ospring from two parents. Edges are bidirectional.
54
00,00 01,01
1 1
00 00,01 01
1/2 1/2
1/2 1/2
1/2 1/2
00,11
00,10 01,11
01,10
1/2 1/2
1/2
1/2
1/2 1/2
10 10,11 11
1 1
10,10 11,11
Figure 11. The one-point crossover landscape for binary strings of length two. The
crossover operator, 21!1, produces one ospring from two parents (and is conse-
quently not walkable). This is done by generating both ospring (as in 21!2 ) and
then selecting one uniformly at random to retain. Other ospring selection methods
would alter edge probabilities.
hillclimbing algorithm (from the description given by Forrest et al. "96]) to increase its
performance. This is described in x3.6(55). The crossover hillclimber was surprising in
its robustness. The parameter settings used were chosen after a few very preliminary
experiments and not altered until the reasons for the success of the algorithm were
later sought.
locus in the population converges to at least this frequency, the run is terminated.
The per-locus mutation probability was 001. The probability that crossover was
applied to a pair before they were added to the next generation was 075. To avoid
complications with selection, binary tournament selection "97] was used as it is rank-
based, avoiding any need to scale population tness or otherwise consider the tness
function, even across experiments. The winner of the tournament was given a 075
chance of being selected. See "98, 99] for discussion on the advantages of rank-based
selection schemes.
A major omission from this list of parameters is some form of elitism "30], either
an explicit copying of the best individual or a generation gap less than one to decrease
the probability of discarding the best structure found so far. There has been recent
discussion on the use of GAs as function optimizers "100, 101], perhaps stemming from
Holland's reminder that the original intent of the GA was more one of improvement
than one of optimization "102]. It is not clear whether a standard GA should include
some form of elitism or not "5, 9, 30, 103]. Grefenstette's standard GA "103], which
takes its lead from De Jong "30] recommends some form of elitism. De Jong refers to
a \canonical" GA (which does not use elitism) and mentions elitism as a method of
improving GA performance when a GA is used as a function optimizer "100].
It is perhaps unfair to compare the GA against two algorithms that use a very
elitist hillclimbing strategy, but then again, they are di
erent algorithms. The obvious
solution is to examine both possibilities. A summary of the result of this is that
elitism never resulted in worse performance (usually the GA with elitism would attain
dicult performance levels in ve to ten times fewer function evaluations than the
GA without elitism), but that the change never resulted in a signi cant change in the
positional ranking of the GA with respect to the other two algorithms. For example,
if the performance on a problem of a GA without elitism was between the other
algorithms, then the GA with elitism would perform better than the GA without
elitism, but still between the other algorithms. In what follows, the standard and
elitist versions of the GA are denoted by GA-S and GA-E. Abbreviations for all the
algorithms of this chapter can be found in Table 1.
Table 1. Algorithm abbreviations and brief descriptions. The last column gives the
page on which a fuller description of each algorithm can be found.
royal road functions examined by Mitchell, Forrest and Holland "36, 96], the algorithm
has the important property of moving inde nitely on a -plateau. The royal road
functions have many such sets, which are not -mesas and so are eventually escaped by
this algorithm. This accounts for its success on these problems over other hillclimbers
that use but do not make moves to points of equal tness.
In general, the operators and representation chosen to attack a search problem
may create many plateaus that are mesas. When BH encounters a -mesa it will
wander on it inde nitely. For this reason, the algorithm has a parameter that limits
the number of steps that may be taken without a tness increase. This parameter is
the only di
erence between BH and RMHC. If this number (which was set to 10,000)
is exceeded, the search is terminated. Without such a limit, BH encounters chronic
problems on the busy beaver problem (described below) as a result of the choice of
representation for that problem. The maximum number of random mutations to try
57
before deciding that a point has no equal or higher neighbor and terminating the
search was set to twice the number of neighbors.
2 Naturally, there is a chance (typically high) that the global optimum (assuming there is only
one) cannot be discovered from a randomly chosen starting pair. With binary strings of length l,
this probability is 1 ; (3=4) .
l
58
(a’’’, b’’’)
(a’’, b’’)
(a’,b’)
(a’’’, c)
Figure 12. A fragment of a hillclimb under crossover. Three uphill steps are taken
(using crossover) within the right hypercube, from (a b) to (a000 b000). At that point
b000 is discarded and replaced with the randomly generated c. Two more steps are
then taken on the left hypercube. The jump into the left hypercube is either the
result of a limit on the number of steps in a hypercube (max-steps) or a limit on the
number of attempts to nd an improving crossover (max-attempts).
the problem clearly has a single global maximum and no -local-optima, there are
local optima for other operators, as demonstrated by Culberson "29], who shows that
the problem contains 21!2-local-optima. The three instances of the problem that
were studied had lengths 30, 60 and 120.
Table 2. A fully easy 6-bit problem. Unitation is the number of bits in the binary
string that are set to one. The maximum tness point is the string 000000 with zero
ones.
Unitation 0 1 2 3 4 5 6
Fitness 10 08 06 09 05 07 09
Fully easy problems were designed with GAs in mind, and their name reects
reasoning that they should be easy for GAs to solve. This does not imply that they
will be easy for other forms of search|by observation, the problem contains -local-
maxima. For instance, a steepest ascent hillclimber using would reach a local
maximum if it started to climb from any string with more than a single one bit. The
global optimum, the string with six zeroes, has a basin of attraction that
6 includes
only seven of the sixty-four possible strings. Each six bit problem has 3 + 66 =
21 -local-maxima, which is almost a third of the entire space. A discussion of
the problem's diculty for BH can be found below. Each problem instance was
constructed by concatenating some number of these 6-bit problems together. The
three experiments used 5, 10 and 15 such subproblems, resulting in problem sizes of
30, 60 and 90 bits.
60
Table 3. A fully deceptive 6-bit problem. Unitation is the number of bits in the
binary string that are set to one. The maximum tness point is the string 111111
with six ones.
Unitation 0 1 2 3 4 5 6
Fitness 090 045 035 030 030 025 100
The rst thing to notice about this problem is that the location of the global
maximum is the string of all ones, the opposite of the fully easy problem. This
problem has one global maximum (all ones) and one -local-maximum (all zeroes).
Under steepest ascent, the global maximum can be reached from only seven points in
the space. The -local-maximum will be reached from the rest of the space.
before halting. The number of 1 symbols that remain on the tape is the machine's
score. The problem is to nd, for a given k, a halting TM with the highest score,
which will be denoted by '(k). More than one machine may generate '(k) ones.
The busy beaver problem is extremely dicult and has an interesting history. The
problem is solved for k 4. There are two 5-state machines that halt and leave 4,098
ones on the tape and a 6-state machine that produces over 95 million! See Brady
"107, 108] for details of k = 4 and Marxen and Buntrock "109] for k = 5. The results
of applying a GA to the problem and the original presentation of reverse hillclimbing
(the subject of Chapter 4) can be found in "110].
3.8.5.1. Fitness and Representation
The tness of a TM was its score, or ;1 if the machine did not halt. The general
halting problem is not an issue for k 4 (as in our experiments), since the problem
is solved for machines with up to 4 states. In each case, the maximum number of
steps taken by a halting machine is known and can be used to terminate evaluation.
A TM with k states was represented by a character string of 6k bytes. When
the machine is in a certain state, looking at a certain symbol, it needs to know three
things: the next state, the symbol to write on the tape and the direction in which to
move. Since the machine could be scanning either a 0 or a 1, six bytes are required.
By restricting crossover to byte boundaries, only legal TM's were generated.
This representation allows the use of simple operators. There are a number of
symmetries that other approaches have taken advantage of that are not exploited
here. For example, the direction the tape head moves when the machine enters the
halt state makes no di
erence to the number of ones left on the tape. Noticing this,
and including it in the representation halves the size of the search space. Similarly,
the symbol written to the tape on the transition into the halt state can be assumed to
be a 1 as this assumption cannot reduce the number of ones the machine has written.
Other symmetries include the renaming of states, and replacing left moves with right
and vice-versa.
The \representation" that has been most widely used in other work on this prob-
lem is known as Tree Normal Form. This takes advantage of all the above symmetries.
Unfortunately it is not so much a representation as an algorithm for producing and
running TMs. If such a representation were adopted, and there is no doubt that it
could be, it would also require the construction of specialized operators designed to
produce TMs of the correct form. These issues are exactly those that were discussed
in x1.3(5). Once again, intelligent observations about the nature of the problem lead
to representations that greatly reduce the size of the problem. For example, for 4
62
3.8.5.2. Mesas
The combination of representation and operator here results in a landscape with many
mesas. The operator used by BH changes any state into any other, a left movement
into a right or a one into a zero. Every halting TM in the landscape is connected
by an edge to a twin that produces the same number of ones. This corresponds to
changing the direction of movement when entering the halt state. Thus every TM
with no tter neighbors is part of a mesa of size at least 2 which, unless modi ed,
BH can never escape. This was the motivation for imposing a limit on the number of
non-improving steps that BH allows.
It is also interesting to note that every halting TM is connected by an edge to at
least one TM that does not halt. A halting TM can be made to not halt by modifying
the behavior when in the initial state so it remains in the state (either symbol may
be written and either direction may be taken). Thus every machine with maximal
tness also has at least one neighbor that has lowest tness. Similarly, each TM
with a single transition into the halt state can be made to not halt by altering this
transition. These phenomena are the result of a simplistic choice of representation
and operators.
and region. These are introduced below and can be relied on to always mean exactly
the same thing.
3.8.6.1. Description
The function takes a binary string as input and produces a real value which the
searcher must maximize. The string is composed of 2k non-overlapping contiguous
regions, each of length b + g. With Holland's defaults, k = 4, b = 8, g = 7, there are
16 regions of length 15, giving an overall string length of 240. We will number the
regions, from the left, as 0 1 : : : 2k ; 1.
Each region is divided into two non-overlapping pieces. The rst, of length b,
will be called the block. The second, of length g, will be called the gap.3 In the
tness calculation, only the bits in the block part of each region are considered. The
bits in the gap part of each region are completely ignored during tness calculation.
Holland's description called the block part of each region by various names: \building
blocks," \elementary building blocks," \schemata," \lower level target schemata,"
and \elementary (lowest-level) building blocks (schemata)."
The tness calculation proceeds in two steps. Firstly, there is what Holland
calls the part calculation. Then follows the bonus calculation. The overall tness
assigned to the string is the sum of these two calculations.
3 Holland did not give a name for this part of the regions, or give a variable name for its length.
These gaps have been called \introns" (borrowing from biology) and have been used with varying
degrees of success 96, 113] to alter the eects of crossover in genetic algorithms.
4 Holland used m(i) to denote the number of 1's in block i. I will not use a variable.
64
Finally, if a block consists entirely of 1's (i.e., it has b 1's), it receives nothing
from the part calculation. Such a block will be rewarded in the bonus calculation.
If a block consists entirely of 1 bits, it will be said to be complete. From the above,
we can construct a table of tness values based on the number of ones in a block.
Table 4 gives the values for the default settings.
Unitation 0 1 2 3 4 5 6 7 8
Block tness 000 002 004 006 008 ;002 ;004 ;006 000
block is labeled Bi2l with 0 i < 2k;l . At all levels, the rst such set of complete
blocks receives tness u, and additional sets of completed blocks receive tness u.
The total tness for the level is the sum of these tnesses.
To make this more concrete, consider the function with Holland's default val-
ues. With k = 4 we have 16 regions and each contains a block of length b =
8. The bonus tness calculation rewards completed single blocks (this is level
0), and rewards the completion of the sets fB0 B1g, fB2 B3g, fB4 B5g, fB6 B7g,
fB8 B9g, fB10 B11g, fB12 B13g, fB14 B15g, (level 1) fB0 : : : B3g, fB4 : : : B7g,
fB8 : : : B11g, fB12 : : : B15g (level 2) fB0 : : : B7g, fB8 : : : B15g (level 3) and -
nally fB0 : : : B15g (level 4) of completed blocks. The total bonus contribution to
the tness is computed by adding the tness at each of the k + 1 levels.
3.8.6.2. Experiments
The experiments on Holland's royal road in this chapter only varied k, which deter-
mines the number of level zero blocks and hence the number of levels. The three
values used were 2, 4 and 6, giving problems with 60, 240 and 960 bits.
3.9. Results
The results of twelve of the eighteen experiments are shown in Figures 13 to 24.
Results for the smallest instance of each problem are not shown. Performance on
the smallest instances of each problem qualitatively matched performance on the
larger instances. In all these graphs, the X axis (mean number of evaluations) has
a log scale. Each of the algorithms was run at least 2000 times on each instance of
each problem and in most cases at least 10,000. Each time an algorithm achieves a
performance level (for instance, completing a fully deceptive subproblem), the number
of evaluations is recorded. From this, the mean number of evaluations taken to
achieve performance levels is plotted for each algorithm on each problem. The line
representing an algorithm is terminated at the highest level that was achieved at least
ten times. Standard errors for the following graphs are presented in Tables 40 to 69
in Appendix C.
worst algorithm on four of the functions (fully deceptive, distributed fully deceptive,
fully easy and Holland's royal road), and the two GA variants are clearly the worst
algorithms on the remaining two (busy beavers and one max). On three of the
functions (fully deceptive, fully easy and Holland's royal road), the CH algorithm is
the best performer and it can easily be argued that it is the best on a fourth (busy
beavers). Only on one problem of the six, one max, is it signi cantly beaten (by BH,
which is hardly surprising).
It is interesting to note how much consistency there is in the graphs for each
experiment. Each of the six experiments has three graphs and within each set of three
there is remarkable consistency. The shapes of the lines representing each algorithm
are well preserved over the graphs, as are the relative orderings of the algorithms and
the ways that the orderings change, and when they change, as higher performance
levels are achieved.
The addition of elitism to the GA resulted in improvement in every case (though
some were very minimal). However, in no case was the improvement great enough to
signi cantly alter the relative ranking of the GA without elitism. For example, if the
order of the three algorithms CH, GA-S and BH was, say, BH, GA-S, CH from best
to worst, then the performance of GA-E would fall between that of BH and GA-S.
Table 5 gives a rough overview of the performance of the algorithms. Each
BH CH GA
Busy beavers 1 2 3
One max 1 2 3
Fully deceptive 3 1 2
Distributed deceptive 3 1 1
Fully easy 3 1 2
Holland's royal road 3 1 2
algorithm is ranked for each problem, with a rank of 1 being the best. When it was
67
not clear which algorithm was better from looking at the three experiments for a
problem, preference was given to the algorithm that performed best on the largest
instance of the problem and algorithms received equal rank if there was still no clear
di
erence. The two versions of the GA have been coalesced since they virtually always
rank next to each other. This allows a clearer view of the overall pattern. It should
be clear that CH is never the worst, the GA is never clearly the best and BH is either
best or worst.
50 100
Bits Set to One
45 90
40 80
35 70
30 60
1E0 1E1 1E2 1E3 1E4 1E5 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7
Mean Evaluations Mean Evaluations
Figure 13. Mean evaluations to solve a 60- Figure 14. Mean evaluations to solve a 120-
bit one max problem using crossover hillclimb- bit one max problem using crossover hillclimb-
ing (CH), a standard GA (GA-S), an elitist ing (CH), a standard GA (GA-S), an elitist
GA (GA-E) and random mutation hillclimbing GA (GA-E) and random mutation hillclimbing
(BH). (BH).
described in x3.8.1(58). These are consistent with what we might expect from pre-
vious work by Ackley "26] and Culberson "29]. Ackley's results show two versions of
a hillclimber clearly outperforming two versions of a GA. Culberson demonstrated
the existence of 21!2-local-maxima in the component of the one-point crossover land-
scape that corresponds to complementary binary strings for the one max function.
There are 22!2-mesas in components of the two-point crossover landscape for non-
complementary strings. A trivial example is any pair (a a) where a is not the global
maximum. In general, if f (a b) = maxfg(a) g(b)g and g(v) is the number of ones
68
CH CH
10
GA-S 14 GA-S
9 GA-E GA-E
BH BH
8 12
Blocks Completed
Blocks Completed
7 10
6
8
5
4 6
3 4
2
2
1
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 15. Mean evaluations to solve 10 fully Figure 16. Mean evaluations to solve 15 fully
easy 6-bit problems using crossover hillclimb- easy-6 bit problems using crossover hillclimb-
ing (CH), a standard GA (GA-S), an elitist ing (CH), a standard GA (GA-S), an elitist
GA (GA-E) and random mutation hillclimbing GA (GA-E) and random mutation hillclimbing
(BH). (BH).
CH CH
10 12
GA-S GA-S
9 GA-E
11 GA-E
BH 10 BH
8
9
Blocks Completed
Blocks Completed
7 8
6 7
5 6
4 5
4
3
3
2 2
1 1
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7
Mean Evaluations Mean Evaluations
Figure 17. Mean evaluations to solve 10 fully Figure 18. Mean evaluations to solve 15 fully
deceptive 6-bit problems with crossover hill- deceptive 6-bit problems with crossover hill-
climbing (CH), a standard GA (GA-S), an eli- climbing (CH), a standard GA (GA-S), an eli-
tist GA (GA-E) and random mutation hill- tist GA (GA-E) and random mutation hill-
climbing (BH). climbing (BH).
evaluations than GA-S to achieve the highest performance levels. The performance
of GA-E was quite close to that of the CH algorithm on the largest of the problems.
No algorithm managed to optimize all 15 deceptive blocks in the hardest experiment.
BH performed very poorly.
CH CH
6 GA-S 7 GA-S
GA-E GA-E
BH 6 BH
5
Blocks Completed
Blocks Completed
5
4
4
3
3
2
2
1 1
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 19. Mean evaluations to solve 10 dis- Figure 20. Mean evaluations to solve 15 dis-
tributed fully deceptive 6-bit problems using tributed fully deceptive 6-bit problems using
crossover hillclimbing (CH), a standard GA crossover hillclimbing (CH), a standard GA
(GA-S), an elitist GA (GA-E) and random mu- (GA-S), an elitist GA (GA-E) and random mu-
tation hillclimbing (BH). tation hillclimbing (BH).
CH CH
6 GA-S GA-S
12
GA-E GA-E
5 BH BH
10
Ones Left on Tape
Figure 21. Mean evaluations to nd 3-state Figure 22. Mean evaluations to nd 4-state
Turing machines for the busy beaver problem Turing machines for the busy beaver problem
using crossover hillclimbing (CH), a standard using crossover hillclimbing (CH), a standard
GA (GA-S), an elitist GA (GA-E) and random GA (GA-S), an elitist GA (GA-E) and random
mutation hillclimbing (BH). mutation hillclimbing (BH).
virtually indistinguishable, with the GAs coming out slightly ahead. In all cases,
the BH algorithm does very poorly, which is hardly surprising as one of Holland's
design goals was to create a function that would be very dicult to optimize via
mutation-based hillclimbing.
The closeness in the results for the GAs and the CH algorithm may be due to
the comparatively large size of the problem (960 bits). It is worth noting that the
algorithms actually climb fewer levels (only 3) in this problem than in the simpler
k = 4 problem (where they climb 4). The harder problem has 7 levels that could
have been climbed whereas the easier problem only has 5. Increasing the parameters
on the GAs and CH to allow them to search longer may have revealed a di
erence.
CH CH
GA-S GA-S
4 GA-E GA-E
3
BH BH
Levels Achieved
Levels Achieved
3
1
1
1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7
Mean Evaluations Mean Evaluations
Figure 23. Mean evaluations to climb levels in Figure 24. Mean evaluations to climb levels in
Holland's royal road with k = 4 using crossover Holland's royal road with k = 6 using crossover
hillclimbing (CH), a standard GA (GA-S), an hillclimbing (CH), a standard GA (GA-S), an
elitist GA (GA-E) and random mutation hill- elitist GA (GA-E) and random mutation hill-
climbing (BH). climbing (BH).
operator, at times more powerful than mutation. The algorithm was designed to
examine crossover in a simple context. To a certain extent this has been achieved,
but one loose end remains to be considered. CH is creating new individuals in two
ways: through the use of crossover and at random. Initially, two random individuals
are created and thereafter a random individual is created each time the algorithm
jumps to a new hypercube. How important are these random individuals, and what
role do they play in aiding the search? It is simple to examine the number of times
a random individual is tter than any individual seen so far. When this is done, the
results are unsurprising. Random creation produces very few highly t individuals.
For example, Table 6 shows the number of times a randomly created individual was
responsible for an increase in best tness over 1000 runs of CH on the 4-state busy
beaver problem. In this example, a Turing machine that left seven ones on the tape
before halting was discovered on 990 of the 1000 runs, but on only 1 of these was the
rst such discovery a result of random creation.
To get an idea of the importance of the random individuals as a source of material
for crossover to exploit, the CH algorithm used above, CH(10,3,1000), was compared
to two variant crossover hillclimbing algorithms, CH(10,3000,0) and CH(10,1,3000).
The two new algorithms represent extremes of the spectrum with respect to the
random creation of new individuals. CH(10,3000,0) never jumps to a new hypercube
74
Table 6. The frequency of tness increases due to random discovery in 1000 runs of
the CH algorithm on the 4-state busy beaver problem. The new individual must be
the best found so far to gain mention here. The number of times each tness level
was discovered is shown for comparison.
when no acceptable crossover can be found from the current pair. Each search will
therefore take place entirely within the hypercube determined by the initial choice
of random individuals. When an improving (or equaling) crossover cannot be found
from the current pair, the search is concluded. This algorithm will be denoted by
CH-NJ (NJ = No Jumps). At the other extreme, CH(10,1,3000) allows at most 1
step to be taken by a pair within a hypercube before a jump to another hypercube is
forced. This algorithm will be denoted by CH-1S, (1S = 1 Step). These parameter
settings limit all three of these algorithms to 3000 steps via crossover.
Figures 25 to 30 show the results on the largest instance of each of the problems
(with the exception of Holland's royal road, which shows the default problem). It
is clear that CH-1S is the best of the three algorithms on these instances of these
problems. Experiments with other settings of the three parameters of the crossover
hillclimbing algorithm (not shown) have provided further evidence that the algorithm
performs better the more often it jumps between hypercubes. CH-1S is extreme in
this respect, as it jumps after every single crossover. This is remarkable as the rst
crossover performed after a jump always involves a random individual (the jump is
75
120 CH CH
CH-1S 14 CH-1S
CH-NJ CH-NJ
110
12
Blocks Completed
100
Bits Set to One
10
90 8
6
80
4
70
2
60
1E0 1E1 1E2 1E3 1E4 1E5 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 25. Mean evaluations to solve a 120-bit Figure 26. Mean evaluations to complete 15
one max problem using crossover hillclimbing 6-bit fully easy subproblems using crossover
(CH), crossover hillclimbing with no jumps be- hillclimbing (CH), crossover hillclimbing with
tween hypercubes (CH-NJ) and crossover hill- no jumps between hypercubes (CH-NJ) and
climbing with at most one step per hypercube crossover hillclimbing with at most one step
(CH-1S). per hypercube (CH-1S).
14 CH CH
CH-1S 7 CH-1S
CH-NJ CH-NJ
12
6
Blocks Completed
Blocks Completed
10
5
8
4
6
3
4
2
2
1
1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 27. Mean evaluations to complete 15 6- Figure 28. Mean evaluations to complete 15 6-
bit fully deceptive subproblems using crossover bit distributed fully deceptive subproblems us-
hillclimbing (CH), crossover hillclimbing with ing crossover hillclimbing (CH), crossover hill-
no jumps between hypercubes (CH-NJ) and climbing with no jumps between hypercubes
crossover hillclimbing with at most one step (CH-NJ) and crossover hillclimbing with at
per hypercube (CH-1S). most one step per hypercube (CH-1S).
76
CH CH
CH-1S CH-1S
12 4
CH-NJ CH-NJ
10
Ones Left on Tape
Levels Achieved
8 3
6
2
4
2
1
0
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E2 1E3 1E4 1E5 1E6 1E7
Mean Evaluations Mean Evaluations
Figure 29. Mean evaluations to nd 4-state Figure 30. Mean evaluations to achieve lev-
busy beaver TMs using crossover hillclimbing els on Holland's royal road with k = 4 us-
(CH), crossover hillclimbing with no jumps be- ing crossover hillclimbing (CH), crossover hill-
tween hypercubes (CH-NJ) and crossover hill- climbing with no jumps between hypercubes
climbing with at most one step per hypercube (CH-NJ) and crossover hillclimbing with at
(CH-1S). most one step per hypercube (CH-1S).
the idea, for instance via one-point, two-point, uniform or one of the many forms
of crossover used for representations other than xed-length strings. All forms of
crossover share a similar idea, but the mechanics vary considerably.
The experiments with various forms of crossover hillclimbing show that crossover
may be very useful even in the absence of the idea of crossover. That is, even when
there is no reason to believe that both parents have an above average chance of con-
tributing above average material to the o
spring, crossover may still be useful simply
through performing macromutation via its mechanics. When considering whether
crossover is useful to a GA, we should attempt to distinguish the gains the algorithm
is making using the idea of crossover from those made simply through the mechanics.
If there is no additional gain due to the idea of crossover, it may be the case that we
would do as well if we discarded the population (and thus the GA) and instead used
an algorithm that employed macromutation.
Random Crossover
Crossover
Parent 1 Child 1
Parent 2 Child 2
Crossover
Figure 31. The random crossover operator used in the headless chicken test. When
the GA with random crossover passes two parents to the operator, instead of recom-
bining them as a normal crossover would do, two random individuals are created and
these are crossed with the parents. One ospring is kept from each of these crosses
and these are handed back to the GA. The crossover method used (one-point, two-
point, uniform, etc.), is the same as that used in the version of the GA that is being
compared to the GA with random crossover.
crossover points and set all the loci between the points to randomly chosen alleles.
This is an identical operation and is clearly just a macromutation.
This comparison was performed (1) between GA-S and an identical GA with
random crossover (GA-RC) and (2) between GA-E and an elitist GA with random
crossover (GA-RCE). Figures 32 to 37 show the result of this experiment on the
problem instances used to compare the CH variants. The results of the comparison
between GA-E and GA-RCE are not shown (these exhibit identical qualitative results
to those of GA-S versus GA-RC). On all instances of all problems, GA-S is a clear
winner over GA-RC when the problem contains well-de ned building blocks. This
is the case with the fully deceptive, fully easy, Holland's royal road and one max
problems. On these problems, the idea of crossover is aiding the search, as we would
hope.
More interesting is the comparison on the problems where well-de ned building
blocks do not exist: the busy beaver and distributed fully deceptive problems. On the
busy beaver problems, the performance of the GA and GA-RC is hard to distinguish.
On the distributed fully deceptive problem, GA-RC actually outperforms the GA.
79
Blocks Completed
99 8
Bits Set to One
7
90 6
5
81
4
72 3
2
63 1
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 32. Mean evaluations to solve a 120-bit Figure 33. Mean evaluations to complete 15
one max problem using a standard GA (GA-S) 6-bit fully easy subproblems using a standard
and a GA with random crossover (GA-RC). GA (GA-S) and a GA with random crossover
(GA-RC).
GA-S GA-S
11
GA-RC 6 GA-RC
10
9 5
Blocks Completed
Blocks Completed
8
7 4
6
5 3
4
3 2
2
1
1
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 34. Mean evaluations to complete 15 Figure 35. Mean evaluations to complete 15
6-bit fully deceptive subproblems using a stan- 6-bit distributed fully deceptive subproblems
dard GA (GA-S) and a GA with random using a standard GA (GA-S) and a GA with
crossover (GA-RC). random crossover (GA-RC).
80
12 GA-S GA-S
GA-RC GA-RC
10 3
Ones Left on Tape
Levels Achieved
8
6 2
2 1
0
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E1 1E2 1E3 1E4 1E5 1E6
Mean Evaluations Mean Evaluations
Figure 36. Mean evaluations to nd 4-state Figure 37. Mean evaluations to achieve levels
busy beaver TMs using a standard GA (GA-S) on Holland's royal road with k = 4 using a
and a GA with random crossover (GA-RC). standard GA (GA-S) and a GA with random
crossover (GA-RC).
In this case, the GA is not only using an operator (two-point crossover) that, in an
informal sense, has a bias that is orthogonal to the structure of the encoding, but it
is also drawing its samples from a population. GA-RC is using the same operator
but has the advantage of being able to explore more widely. When GA-RC is as good
or better than the normal GA on a problem, it suggests that the idea of crossover is
doing nothing signi cant for the normal GA. In such cases it seems safe to conclude
that the chosen combination of representation, tness function and crossover operator
does not result in a problem that contains building blocks that the GA is managing to
exploit using the idea of crossover. De Jong, Spears and Gordon have also identi ed
problems on which the use of crossover results in worse performance "115].
This does not imply that there is no other representation that would contain
exploitable building blocks, or that another crossover operator might not do better at
exploiting the building blocks that do exist (if any), or that it is not possible to modify
the GA in some way so that it does exploit the building blocks (e.g., by using a larger
population, a di
erent tness function, or adopting measures to maintain diversity).
This is a conclusion about a single crossover operator, a single representation and
a single algorithm, and may well not apply if any of these are altered. The test
examines a particular set of choices and nothing more. In particular, the conclusion
that a GA is not suitable for a problem, based on a failed headless chicken test, is in
no way justi ed. If a GA fails the headless chicken test (i.e., it does not outperform a
81
GA with random crossover), the GA is not making gains from crossover above those
that could be made via explicit macromutation. If this is the case, it is not clear
why one would bother to maintain a population or, consequently, use a GA on this
combination of problem, representation and tness function.
The experiment suggests an explanation for GAs that appear to make good use of
crossover early in a run, but not thereafter. It may be the case that the combination
of the representation and tness function does not have building blocks that can be
exploited by the crossover operator in question. In this case, crossover may make
improvements simply through macromutations at the start of the run and then, as is
widely known, become increasingly less useful as the population converges. Crossover
appears to be useful (since disabling it results in worse performance), but in fact, its
usefulness is merely a consequence of the macromutations it is performing.
algorithm. Once again, the results depended on whether well-de ned building blocks
existed. When they did, BH-MM outperformed BH-DMM by orders of magnitude.
When they did not (on the distributed fully deceptive problem), BH-DMM performed
an order of magnitude better than BH-MM. On every problem, BH-MM achieved the
highest levels of performance at least one order of magnitude faster than GA-E, and
in one case (15 fully easy subproblems) it required over 3 orders of magnitude fewer
evaluations to achieve the highest level reached by GA-E. Typically, the hillclimbers
would also reach levels of performance that were never achieved by the GA. The
results for the six problems instances are shown in Figures 38 to 43.
120 CH-1S CH-1S
GA-E 14 GA-E
BH-MM BH-MM
110
BH-DMM BH-DMM
12
Blocks Completed
100
Bits Set to One
10
90 8
6
80
4
70
2
60
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7
Mean Evaluations Mean Evaluations
Figure 38. Mean evaluations to solve a 120-bit Figure 39. Mean evaluations to complete 15
one max problem using crossover hillclimbing 6-bit fully easy subproblems using crossover
with at most one step per hypercube (CH-1S), hillclimbing with at most one step per hy-
an elitist GA (GA-E), hillclimbing with macro- percube (CH-1S), an elitist GA (GA-E), hill-
mutations (BH-MM) and hillclimbing with dis- climbing with macromutations (BH-MM) and
tributed macromutations (BH-DMM). hillclimbing with distributed macromutations
(BH-DMM).
3.14. Summary
The traditional method used to assess whether crossover is useful to a GA is to
compare the GA with crossover to a GA without crossover. This chapter argues that
if the GA with crossover performs better, the conclusion that crossover is therefore
useful (i.e., that it is facilitating the exchange of building blocks between individuals)
is not justi ed. A more informative method of assessing the worth of crossover to
83
CH-1S CH-1S
GA-E 7 GA-E
14
BH-MM BH-MM
12
BH-DMM 6 BH-DMM
Blocks Completed
Blocks Completed
10 5
8 4
6 3
4 2
2 1
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 40. Mean evaluations to complete 15 6- Figure 41. Mean evaluations to complete 15 6-
bit fully deceptive subproblems using crossover bit distributed fully deceptive subproblems us-
hillclimbing with at most one step per hy- ing crossover hillclimbing with at most one step
percube (CH-1S), an elitist GA (GA-E), hill- per hypercube (CH-1S), an elitist GA (GA-E),
climbing with macromutations (BH-MM) and hillclimbing with macromutations (BH-MM)
hillclimbing with distributed macromutations and hillclimbing with distributed macromuta-
(BH-DMM). tions (BH-DMM).
CH-1S CH-1S
GA-E 5 GA-E
12
BH-MM BH-MM
BH-DMM BH-DMM
10
4
Ones Left on Tape
Levels Achieved
8
3
6
4 2
2
1
0
1E0 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8 1E1 1E2 1E3 1E4 1E5 1E6 1E7 1E8
Mean Evaluations Mean Evaluations
Figure 42. Mean evaluations to nd 4-state Figure 43. Mean evaluations to achieve lev-
busy beaver TMs using crossover hillclimbing els on Holland's royal road with k = 4 using
with at most one step per hypercube (CH-1S), crossover hillclimbingwith at most one step per
an elitist GA (GA-E), hillclimbing with macro- hypercube (CH-1S), an elitist GA (GA-E), hill-
mutations (BH-MM) and hillclimbing with dis- climbing with macromutations (BH-MM) and
tributed macromutations (BH-DMM). hillclimbing with distributed macromutations
(BH-DMM).
84
a GA was presented. This compares the normal GA with a GA that uses random
crossover. Such a GA dispenses with the idea of crossover while maintaining the
mechanics. This makes it possible to see what advantage the standard GA is gaining
from the idea of crossover over and above what it could be gaining from the mechanics
alone. In some cases, when well-de ned building blocks are not present, the GA may
actually perform worse with normal crossover than a GA with random crossover as
a result of its use of a population. This test gives an indication of when building
blocks (that are exploitable by a GA with a given crossover operator) exist in a
representation.
The major reason for the maintenance of a population in a GA is to allow the com-
munication of information between individuals via crossover. When testing whether
it is worth bringing a population and crossover to bear on a problem, the null hypoth-
esis should be that the GA with crossover does not outperform the GA with random
crossover, not that the GA with crossover outperforms the GA without crossover. If
this comparison indicates that the idea of crossover is not producing signi cant gains,
the use of a GA is probably not warranted. In these cases, there are simpler algo-
rithms that use the macromutational mechanics of crossover and do not maintain a
population, and these can easily outperform the GA. Several such algorithms, based
on crossover hillclimbing or macromutational hillclimbing were presented. This point
was also made by Fogel and Atmar "88]. On the other hand, when building blocks
do exist, these algorithms (especially those that use macromutation exclusively) still
outperformed the GA on virtually every instance of every problem addressed. These
results support Eshelman and Scha
er's belief that the \niche" in which crossover
gives a GA a competitive advantage may be quite small "89]. Actually, the situation
may be worse than they feared, as a macromutational hillclimber easily outperforms
a GA on Holland's royal road, which has the properties that Eshelman and Scha
er
ascribe to problems residing in crossover's niche. The niche becomes smaller if we
insist that the idea of crossover, in addition to the mechanics, be responsible for good
performance. In the context of function optimization, it is clear that care must be
taken if one intends to make e
ective use of a population, crossover, and a GA.
CHAPTER 4
Reverse Hillclimbing
4.1. Introduction
There are many algorithms that can be informally described as hillclimbers. Though
simplistic, many hillclimbers prove surprisingly ecient in some settings. Slight en-
hancements can result in algorithms which are amongst the most useful (e.g., simu-
lated annealing). Despite the simplicity of hillclimbing algorithms, it is dicult to
accurately compare these algorithms to each other. To compare the performance of
two hillclimbers, the simplest method is to run them both a large number of times.
The results give a statistical indication of which algorithm is superior. This chap-
ter introduces a new technique, reverse hillclimbing, which, in some cases, allows a
much more direct comparison. This technique is used to compare four hillclimbing
algorithms on three instances of two problems.
The method also makes it possible to compute statistics regarding basins of
attraction in a landscape graph. Given a peak on a landscape, reverse hillclimbing
computes the exact size of the basin of the point under a hillclimbing algorithm and
the exact probability that the peak will be found via a single hillclimb. An attempt to
obtain similar statistics using hillclimbing to form estimates would probably consume
inordinate amounts of time. For example, a typical global optimum in the 4-state
busy beaver problem has a basin of attraction under steepest ascent hillclimbing of
size approximately 5000. The probability of locating such a peak from a randomly
started hillclimb is usually about 1 in 15 million. Thus one would expect to perform
15 million hillclimbs before stumbling on the peak even once. This would net a few
of the 5000 vertices in the basin. Obtaining a suitably reliable estimate (let alone the
exact number) of the basin's size would require a far greater number of hillclimbs.
Reverse hillclimbing can provide exact answers to these questions, and others, in
under a minute with very limited computational requirements. To my knowledge,
86
choose a vertex v 2 V
while (not done) f
Use to generate w 2 N(v)
v w
g
The initial vertex is commonly chosen by uniform random selection from the
representation space. There are many other possibilities. For example, the search
may commence from a vertex generated by another search method, or the initial
vertex may be chosen according to some belief about the nature of the landscape, or
may be a starting con guration that is given by the statement of the problem. The
second step, generating w 2 N(v) also has many variants. In the algorithms which
we will consider, will employ to generate N
N (v), one element of which, w,
will be selected. As examples, some operators will use to generate N = N(v), the
entire -neighborhood of v and then select the ttest. Others will use to generate a
xed number of -neighbors of v, or generate -neighbors until a vertex tter than v
is located or some limit is reached. In the nal step, the algorithm either terminates
or loops for another iteration. The decision of when to terminate a search may be
according to some maximum number of evaluations, the achievement of a desired
tness level, the detection of a local maximum, the result of running out of time for
the search, or some other method. The above description also deliberately makes no
mention of the return value of the algorithm. Typically these algorithms keep track
of the best element of R encountered during the search and return that. Algorithms
that never select a w such that f (w) <F f (v) can simply return v when they are
done.
LSA
<=>
LSA
>=
RMHC
LSA
>
BH Least Ascent
Median Ascent
Any Ascent
Steepest Ascent
Next Ascent
Simulated Annealing
Figure 44. A division of local search algorithms according to the types of tness
moves they make. Hillclimbing algorithms comprise the inner two circles. The RMHC
and BH algorithms are described in Chapter 3.
90
Figure 45. The landscape produced by the Figure 46. The landscape produced by the
any ascent operator given a representation least ascent operator given a representation
space f0 1g3 and a hypothetical tness func- space f0 1g3 and a hypothetical tness func-
tion. Vertices are labeled with tnesses and tion. Vertices are labeled with tnesses and
edges with transition probabilities. edges with transition probabilities.
100 1 101 100 101
18 34 18 34
1
1/2
1/2 1
40 000 001 20 40 000 001 20
Figure 47. The landscape produced by the Figure 48. The landscape produced by the
median ascent operator given a representation steepest ascent operator given a representation
space f0 1g3 and a hypothetical tness func- space f0 1g3 and a hypothetical tness func-
tion. Vertices are labeled with tnesses and tion. Vertices are labeled with tnesses and
edges with transition probabilities. edges with transition probabilities.
91
basin size(v)
VERTEX v
f
S = S fvg
for each (wjv 2 N(w) and f (w) <F f (v))
basin size(w)
g
The function basin size may be called more than once on the same vertex. To
see why this is true, consider Figure 49 which shows two paths to the same
vertex in a one max problem.
111
000
Figure 49. Two downhill trajectories when performing reverse hillclimbing on a one
max problem with strings of length three. The arrows illustrate how a vertex (in this
case 010) can be encountered more than once. In the basic algorithm, this vertex
would be the subject of two recursive calls to the basin size function.
have been encountered for the nal time, at which point they can be output or their
properties can be added to accumulating statistics about the basin. A straightforward
solution to this problem is to use a hash table to implement the set S above to
keep track of the details of vertices that have been encountered. This allows the
maintenance of details of probability of ascent, number of ways to ascend and expected
comparisons to ascend for each vertex in the basin. When the function basin size
discovers a vertex that could ascend to the current argument, it checks the hash table
to see if this vertex has already been encountered. If so, it updates the data structure
containing the information for the vertex. If not, it creates a new data structure and
inserts it into the hash table. A rough outline of how each statistic is kept follows:
If a vertex v1 can ascend along some path to the original vertex v with probabil-
ity p, and it is discovered that v2, a less t neighbor of v1, can ascend to v1 with
probability q, then pq is added to the overall probability that v2 can ascend to
v. Note that the algorithm may encounter v1 again (by nding another downhill
re-ascending path to it), and this will cause the probabilities associated with
both v1 and v2 to be incremented again. When the recursion is nished, every
downhill path will have been taken, and the probabilities associated with the
vertices in the basins will be correct.
It is easy to keep track of the number of paths by which a vertex may ascend
to the original vertex. Simply keep a count of the number of times each vertex
in the basin is encountered during the recursion (including the rst, in which
the vertex is not found in the hash table).
If a vertex can only ascend to a single peak under a given hillclimbing algorithm,
the vertex will be said to be owned by the peak that is reached. To calculate
how many of the vertices in the basin of attraction are owned by the peak, look
at the nal ascent probabilities. Those vertices with probability one are owned.
Calculating the expected number of vertex tness evaluations it will take a
vertex to reach the peak is similar to the calculation of ascent probability.
Information is passed down through the basin as paths are traversed. Three of
the four hillclimbers, LA, MA and SA, use to produce all the -neighbors, so
their expected evaluations per uphill step is simply the number of -neighbors
at each step. To calculate the expected number of evaluations for an uphill step
of AA, it is necessary to determine the number of uphill -neighbors that exist
at that vertex. If we let n be the number of -neighbors and u be the number of
95
uphill -neighbors, then the expected number of evaluations before the vertex
is left is given by
X1 n ; u
i;1 u
E (evaluations) = i n n
i=1
= nu 1n;u 2
1; n
= nu u12
n
= n
u
The second line of this derivation follows from the equation obtained by di
er-
entiating the identity 1;1 x = 1 + x + x2 + x3 + .
This list is merely a brief overview of how further details of the basin of attraction
and its vertices may be obtained. It should provide enough information for the
algorithm to be implemented. Further discussion of implementation is provided in
Appendix A.
The reverse hillclimbing algorithm has the pleasing property of becoming more
e
ective as the landscapes on which it is set to work become more dicult to search.
As problems get harder and bigger, it may be that reverse hillclimbing can only
practically be employed to nd details of the hardest to locate areas of these hardest
problems. Naturally, the algorithm does not help to identify these areas before setting
to work on exploring them.
4.7.1. NK Landscapes
The NK landscapes were introduced by Kau
man "117, 118] as a simplistic model of
systems with many interacting parts, particularly the complex epistatic interactions
found in genomic systems. An NK landscape has two important parameters: N and
K . N is the number of dimensions in the landscape, each dimension having some
xed number of possible values. Attention has been focused almost exclusively on
binary dimensions, and this is also the approach of this chapter. A vertex in an
NK landscape can therefore be represented by a binary string of length N . The N
positions are often referred to as loci.
The tness of a vertex in an NK landscape is computed in an additive fashion.
Each of the N loci makes a contribution to the overall tness. These are then summed
98
and divided by N to obtain the tness of the vertex. The variable K controls the
degree to which the tness contribution of each locus is a
ected by the binary values
found at other loci. K is the average number of other loci that a
ect the tness
contribution of each locus. If K = 0, the tness contribution of each locus is indepen-
dent of every other locus, and the optimal value for each locus can be determined by
simply comparing the overall tness of two points|one with the locus set to 0 and
the other with the locus set to 1. If K = N ; 1, every locus is a
ected by every other
locus and changing the value assigned to a locus also changes the tness contribution
of every other locus.
A simplistic way to implement an NK landscape requires lling an N by 2K+1
array with real numbers chosen uniformly at random from the interval "00 10]. A
simple method of choosing inuencing loci is to consider the N loci to form a ring, and
let the bK=2c loci to the left and dK=2e loci to the right of each locus inuence that
locus. An NK landscape with N = 4 and K = 2 and an example tness calculation
is shown in Figure 50. Neighborhood in an NK landscape is de ned by the operator
de ned in x2.6.1(31). In general, one cannot store N 2K+1 random numbers and
so this simplistic method must be replaced with something more sophisticated, and
there are several ways in which this can be done. This presentation of NK landscapes
is deliberately very brief, as more detailed presentations are easily available "117, 118].
If N is xed and the value of K is allowed to vary, the set of landscapes that
are produced will range from almost certainly unimodal (containing a single global
maximum) for K = 0, to completely random (meaning that the tness of a vertex
is completely uncorrelated with that of any of its N one-mutant neighbors) for K =
N ; 1. The NK landscapes are thus said to be tunably rugged. Because there are
aspects of some NK landscapes that are amenable to analytical investigation and
because there are other aspects of these landscapes that have been studied empirically
"117, 118], these landscapes provide a good testbed for reverse hillclimbing studies.
Locus Influencers
0 1 2 3
N
001 .42 .77 .35 .11
e
i
010 .68 .31 .04 .89
g Fitness calculation for string 0110
h
011 .91 .17 .25 .70
b
Locus Neighborhood Fitness
o 0 001 .42
100 .93 .12 .73 .53
r 1 011 .17
h 2 110 .38
101 .59 .64 .82 .77
o 3 100 .53
o
110 .19 .94 .38 .21
d Fitness = (.42 + .17 + .38 + .53) / 4
this chapter investigate the relative performance of the various hillclimbing algorithms
on a set of peaks. In these experiments, the aim is to provide an answer to the
question: Which hillclimbing algorithm can be expected to locate a peak using the
fewest function evaluations? The experiment considers the randomly located peaks
and also sets of high- tness peaks. On the studied instances of the busy beaver
problem, the optimal peaks are all known, and the performance of the hillclimbers
is compared given the task of locating any one of the optimal peaks. For the NK
landscapes, the best ve percent of the randomly located peaks are separated, and
the hillclimbers are compared when their task is to nd any one of these peaks. The
reverse hillclimbing algorithm is used to make predictions about the performance of
the four hillclimbers and experiments show these to be very accurate.
The reverse hillclimbing analysis helps to throw light on an important factor
in hillclimbing algorithms. This is the question of how much time should be spent
looking for uphill neighbors before moving. Reverse hillclimbing makes it possible to
see that, on the problems considered, hillclimbers that choose higher neighbors have
higher probabilities of reaching higher peaks. Choosing a higher neighbor incurs a
cost however. The hillclimber must examine more neighbors. Finding a good balance
between these two is important to overall performance. AA and SA represent quite
di
erent responses to this tradeo
. AA performs few function evaluations to nd
any uphill neighbor whereas SA examines all neighbors to guarantee that one with
maximal tness is selected. Several new hillclimbers that explore this tradeo
are
examined in Appendix B.
In the experiments of this chapter, three NK landscapes were investigated. In
each of these, the value of N was 16. K took on values of 4, 8 and 12. The 2-, 3- and
4-state busy beaver problems are examined.
4.9. Results
4.9.1. NK Landscapes
For each of the three NK landscapes examined, a number of peaks were identi ed (via
hillclimbing). These were then subject to reverse hillclimbing using each of the four
hillclimbing algorithms. Then, for each of these landscapes, the ttest ve percent of
the peaks were examined separately to see if one of the algorithms appeared better
at nding higher peaks.
Table 7 shows the result of reverse hillclimbs from 1000 peaks on an N = 16,
K = 12 landscape. The peaks were located by performing hillclimbs from randomly
101
chosen starting points. The four columns of the table represent the four algorithms.
The rst row of data shows the number of di
erent basin sizes that were found. For
example, of the 1000 di
erent peaks whose basins were explored for LA, there were
only 176 di
erent basin sizes found. The second, third, fourth and fth lines show
the minimum, maximum, mean and standard deviation of the basin sizes that were
found. The second last line shows the overall probability that a hillclimb started from
a randomly chosen location will reach one of the peaks from the set of 1000. The
nal row is the expected number of climbs that it would take for this to occur. This
row is the reciprocal of the previous row.
There are several things that can be noted about this table:
AA has basins that are far greater in size than those of any other algorithm.
The smallest basin found under AA is approximately six times bigger than the
maximum basin size under any other algorithm. That basins under AA are
bigger than the basins under the other algorithms is not surprising. If one
orders the hillclimbers in terms of decreasing out-degree of landscape vertices,
the ranking would be AA rst, MA second and LA and SA roughly equal last.
The same ranking can be observed in mean (and maximum) basin size. Note
that the ranking of algorithms by landscape out-degree would change if N were
odd instead of even, since in that case MA would nd a single median instead
of having to make a random choice between two at every step of its climb.
102
AA also produces far more unique basin sizes, which is not surprising given the
far greater range of these numbers. Of the 1000 basins that were explored, there
were 962 di
erent basin sizes found. What seems initially more remarkable is the
low numbers for the other algorithms, especially for LA and SA. This proves to
have an uninteresting explanation however, which is that there are many small
basins for each of these algorithms. If a hillclimber tends to have small basins,
there will tend to be fewer di
erent basin sizes in a sample. This is con rmed
by looking at the actual basin sizes (not shown). Further evidence that this is
the cause is seen by noting the agreement between the maximum basin sizes
and the number of di
erent basin sizes. If we rank the four algorithms by
decreasing maximum basin size, and also by decreasing number of basin sizes,
we get identical orderings.
The 1 entries for minimum basin size under LA and MA are eye-catching. If each
of the vertices from which reverse hillclimbing is performed in this experiment
is supposed to be a peak, how can any of them have a basin size of only one
(i.e., the peak and no other vertices)? This seemed at rst suspicious but has
a simple solution. These vertices are indeed peaks, which is to say that all
their neighbors are less t than they are. But their basins also have size one,
because each of the neighbors does not ascend to the peak. This cannot happen
under AA since there is always a non-zero probability of an ascent to every
tter neighbor. It is possible in the other three algorithms. It did not happen
with SA in this set of peaks, but that will be seen in later experiments.
These peaks are something of a curiousity, and are without a doubt the hardest
peaks to locate with these algorithms, since their basin of attraction contains
only the peaks themselves. To nd such a peak with a randomly started hill-
climb, the peak itself must be chosen as the starting vertex. As a result, LA,
MA and SA cannot do better than random search when attempting to locate
such a vertex.
The overall probability of locating a peak shows an interesting pattern. If we
ignore AA for the time being, there is a direct correspondence between how
exploitative LA, MA and SA are and how often they expect to nd one of the
set of peaks. In saying that one algorithm is more exploitative than another, it
is meant that if all the tter neighbors of a vertex were discovered and ranked
from low to high according to tness, the more exploitative algorithm would
choose to move to one whose rank was higher. SA is more exploitative than
MA which in turn is more exploitative than LA. Correspondingly, SA has a
103
higher probability of nding one of the set of peaks than MA which has a higher
probability than LA.
Since AA is on average as exploitative as MA (though with far greater variance),
it seems reasonable to expect that its probability of discovery lies between that
of SA and LA, and closer to MA than to either of the extremes. This is in fact the
case. These general patterns of discovery probabilities will persist throughout
the experiments in this chapter. At this stage, it should be remembered that
these results apply to a set of peaks that were located via randomly started
hillclimbs. We have not seen if the pattern persists when we ask what the
probabilities are of locating good peaks.
As the 1000 peaks whose basins were calculated were found via randomly started
hillclimbs, there is nothing special about them and the fact that one algorithm can
nd one of them with a higher probability than another is not particularly interesting.
More interesting is to consider the best of these peaks to see if any algorithm appears
to have an advantage at nding higher peaks. There does not seem to be any a
priori reason to expect such a bias, but it is worth looking for. If this bias is not
found, it is a good indication that the performance of the hillclimbing algorithms may
be accurately estimated via observing their performance on peaks located through
random sampling.
Table 8 is identical in form to Table 7, except the gures displayed show only
the data for the best ve percent (50) of the original 1000 peaks. Points of interest
in Table 8 include the following:
The number of di
erent basin sizes has leveled out across the algorithms. This
is to be expected, since we are now sampling the ttest peaks, and it is not
unreasonable to expect these to have bigger basins and thus, probably more
variance in their basin sizes.
Interestingly, the 1 entries for LA and MA have persisted, when they might have
been expected to disappear as it seems reasonable to expect that they would
correspond to particularly poor peaks. This is de nitely not the case. Further
investigation reveals that 9 of the top 50 peaks have basin sizes of 1 under LA.
The neighbors of a very t vertex, v, are likely to have other neighbors that
are less t than v, and in this case, LA will not move to v from any of its
neighbors. MA exhibits the same qualitative behavior, but its e
ect is greatly
reduced (only 1 of the top 50 peaks had a basin of size 1 under MA). Under
LA, 15 of the top 50 peaks had a basin of size less than 5, whereas under MA
104
Table 8. Reverse hillclimbing from the best 50 of 1000 random peaks in a 16,12 NK
landscape.
AA LA MA SA
# of basin sizes 50 30 48 43
Min. basin size 19,242 1 1 28
Max. basin size 35,130 385 1230 287
Mean basin size 29,243 45 313 102
Basin size s.d. 3489 67 241 59
P(Discovery) 0062 0034 0053 0078
E(Climbs) 1618 2939 1880 1288
only 2 did. It is not hard to see that the global maximum of these function
will always be such a peak under LA and MA! Clearly, LA and MA will not be
much help in locating this vertex.
The pattern of ascent probabilities described above has also persisted. Again,
AA comes out slightly ahead of MA.
Table 9. The result of 10,000 hillclimbs on the 16,12 NK landscape. The table
shows data for the number of uphill steps taken for each algorithm and the number
of evaluations performed per uphill walk.
AA LA MA SA
Av. steps/climb 3455 9335 3836 1999
Min. steps/climb 0 0 0 0
Max. steps/climb 13 33 12 8
Steps/climb s.d. 1878 5919 1857 1064
Av. evals/climb 3122 1664 7837 4899
Min. evals/climb 17 17 17 17
Max. evals/climb 90 545 209 145
Evals/climb s.d. 1063 9470 2966 1702
icantly fewer evaluations per climb than the other algorithms, all of which examine
all neighbors before moving. The reason LA, MA and SA do not perform similar
numbers of evaluations is that they go uphill at di
erent rates, as can be seen from
the mean steps per climb in the table. SA has a mean of about 2 steps before reaching
a peak, while LA takes about 9.
The most interesting question that arises is whether AA will be better than SA or
vice-versa. SA has a lower expected number of climbs to locate a peak, but takes more
evaluations to do so. The answer to this can be found by multiplying the expected
number of climbs from Table 8 by the expected number of evaluations from Table 9.
This produces Table 10.
The table compares the expected performance of each of the hillclimbers with
their observed performance. SA located more peaks in the 10,000 runs but, more
importantly, AA took fewer evaluations on average to locate a peak. Since we are
using algorithm evaluation count as our yardstick for algorithm performance, AA is
106
Table 10. Expected and observed numbers of peaks and evaluations for 50 good NK
16,12 peaks.
AA LA MA SA
E(Peaks found) 617 340 532 777
Peaks found 620 348 578 736
E(Evals/peak) 505 4889 1473 631
Evals/peak 503 4780 1355 665
the best algorithm for nding this particular set of peaks on this particular landscape,
assuming the sampling that gave the expected evaluations per climb data was not
wildly wrong. SA is a clear second, with MA and LA third and fourth. These results
could easily be made more accurate by increasing the number of hillclimbs done to
obtain the data on expected number of evaluations per climb. For our purposes, a
small number of hillclimbs to estimate evaluations per climb was sucient to obtain
estimates that closely matched the actual performance of the algorithms.
Of course, the behavior of the di
erent algorithms may simply be due to chance,
and may not apply to other NK landscapes, let alone other problems. To investigate
this, and also to examine landscapes of di
ering ruggedness, the same experiment
was performed on an N = 16, K = 8 and an N = 16, K = 4 NK landscape.
Tables 11 to 14 are the N = 16, K = 8 counterparts of Tables 7 to 10 for the N = 16,
K = 12 landscape. Table 11 shows the result of performing reverse hillclimbing from
700 peaks in a N = 16, K = 8 landscape. Less peaks are descended from in this
experiment as the landscape contains fewer peaks. Judging from the gures for SA
in the table, the 700 peaks and their basins account for approximately 96% of the
vertices in the landscape. This follows from the fact that SA is with high probability
completely deterministic on an NK landscape (since it is very unlikely that any two
vertices will have identical tness), so multiplying the mean basin size (90) by the
number of basins (700) gives a good estimate of the number of vertices represented by
the sample. In this case the basins represent approximately 63,000 of the 216 vertices.
The 2500 or so vertices that are not represented should (if the mean basin size can be
used reliably) fall into approximately 28 basins. We would expect only one of these
to have tness in the highest 5% of all basins, so we can be reasonably con dent that
107
Table 11. Reverse hillclimbing from 700 random peaks in a 16,8 NK landscape.
AA LA MA SA
# of basin sizes 687 233 526 230
Min. basin size 11,179 1 1 1
Max. basin size 49,275 975 4399 973
Mean basin size 34,812 88 583 90
Basin size s.d. 6613 145 659 109
P(Discovery) 0958 0945 0958 0966
E(Climbs) 1044 1059 1044 1035
the top 5% of our sample of 700 is very close to being the top 5% of peaks in the
entire landscape.
The patterns that were evident in the N = 16, K = 12 landscape can also be
seen on this landscape. Once again, SA has a higher probability of locating a peak
than MA does and it in turn has a higher probability than LA. AA has a probability
that is very close to that of MA (in the table they are identical as only two decimal
places are shown). The basin sizes also show patterns that are the same as those on
the more rugged landscape. AA has by far the biggest basins, then MA and then
SA just barely ahead of LA. Table 12 shows the reverse hillclimbing results when we
restrict attention to the top 5% (35) of the 700 peaks. Again, looking at just the best
peaks in the sample does not alter the relative algorithm performances.
Table 13 shows the result of 10,000 hillclimbs on this landscape. Once again it
is clear that one of SA and AA will prove to be the best algorithm on this landscape.
SA has a better chance of nding a peak on a given climb, but AA takes fewer
comparisons per climb. The expected and observed data for the number of peaks
discovered and the number of evaluations per discovery is presented in Table 14 and
the balance once more favors AA. SA's dominance in probability of discovery does
not match AA's dominance in evaluations per climb.
This story is repeated in every detail on the N = 16, K = 4 landscape. In
this case, all 180 peaks in the landscape were identi ed. The reverse hillclimbs from
108
Table 12. Reverse hillclimbing from the best 35 of 700 random peaks in a 16,8 NK
landscape.
AA LA MA SA
# of basin sizes 34 31 35 32
Min. basin size 35,751 1 618 73
Max. basin size 48,486 975 14,399 829
Mean basin size 43,704 191 2019 349
Basin size s.d. 3393 236 1044 145
P(Discovery) 0159 0102 0156 0186
E(Climbs) 6294 9809 6411 5372
these peaks are shown in Table 15. The best 5% (9) are examined in Table 16,
10,000 hillclimbs were done to gather expected uphill path lengths (Table 17), and
the summary results appear in Table 18.
In this case, the basins of attraction of the peaks account for all the vertices in
the landscape. As we have seen virtually identical behavior on these three landscapes,
it seems reasonable to assume that although we only considered about 70% of the
space in the N = 16, K = 12 landscape, that the results would not have been greatly
di
erent had we examined additional peaks.
Table 13. The result of 10,000 hillclimbs on the 16,8 NK landscape. The table
shows data for the number of uphill steps taken for each algorithm and the number
of evaluations performed per uphill walk.
AA LA MA SA
Av. steps/climb 4632 1262 5092 2833
Min. steps/climb 0 0 0 0
Max. steps/climb 14 38 13 10
Steps/climb s.d. 2257 7062 2244 1317
Av. evals/climb 3637 2188 9847 6233
Min. evals/climb 17 17 17 17
Max. evals/climb 103 625 225 177
Evals/climb s.d. 1250 1129 3591 2107
points in the basins of all the peaks under SA is a set of 777,000 points. In an overall
space of 25,600,000,000 points, this represents only just over 3 10;5 of the entire
space.
Table 20 shows the result of reverse hillclimbing from the 48 known optimal TMs
for the 4-state problem. The small number of di
erent basin sizes for the algorithms
is the result of redundancies in the representation. Once again, the four algorithms
exhibit an ordering, by probability of reaching any peak on a single hillclimb, that
should by now be becoming familiar. SA is a clear winner, AA is second, followed by
MA and then LA. This pattern was also seen in the best ve percent of peaks for the
three NK landscapes (see Tables 8 to 16). In all three of those cases, SA ranked rst
and AA second. In two of them, MA was third and LA last and in one, LA third and
MA last. This pattern will be seen in the three busy beaver landscapes. In fact, the
pattern in the busy beaver landscapes is exactly that seen in the NK landscapes. SA
is rst in all three, AA is second in all three and MA beats LA on all but the simplest
of the problems.
To examine whether the reverse hillclimbing results are accurate and to determine
which algorithm can be expected to nd peaks using the fewest evaluations, it is
110
Table 14. Expected and observed numbers of peaks and evaluations for 35 good NK
16,8 peaks.
AA LA MA SA
E(Peaks found) 1589 1019 1560 1862
Peaks found 1451 1042 1458 1854
E(Evals/peak) 229 2146 631 335
Evals/peak 251 2100 675 336
Table 15. Reverse hillclimbing from the 180 peaks in a 16,4 NK landscape.
AA LA MA SA
# of basin sizes 179 140 178 158
Min. basin size 14,703 1 72 6
Max. basin size 55,146 5029 14,207 2565
Mean basin size 43,640 364 2761 364
Basin size s.d. 7944 588 2738 446
P(Discovery) 10 10 10 10
E(Climbs) 1 1 1 1
Table 16. Reverse hillclimbing from the best 9 of the 180 peaks in a 16,4 NK land-
scape.
AA LA MA SA
# of basin sizes 9 9 9 9
Min. basin size 42,817 69 4566 644
Max. basin size 54,927 5029 14,207 2485
Mean basin size 51,768 1161 7637 1444
Basin size s.d. 3723 1470 3499 634
P(Discovery) 0165 0160 0151 0198
E(Climbs) 6076 6269 6618 5042
112
Table 17. The result of 10,000 hillclimbs on the 16,4 NK landscape. The table
shows data for the number of uphill steps taken for each algorithm and the number
of evaluations performed per uphill walk.
AA LA MA SA
Av. steps/climb 6064 1520 6622 3886
Min. steps/climb 0 0 0 0
Max. steps/climb 19 44 19 10
Steps/climb s.d. 2587 3373 2695 1484
Av. evals/climb 4158 2602 1229 7918
Min. evals/climb 17 17 17 17
Max. evals/climb 126 721 321 177
Evals/climb s.d. 1357 1179 4312 2374
Table 18. Expected and observed numbers of peaks and evaluations for 9 good NK
16,4 peaks.
AA LA MA SA
E(Peaks found) 1645 1595 1511 1983
Peaks found 1530 1592 1545 2017
E(Evals/peak) 252 1631 813 399
Evals/peak 272 1634 795 393
113
Table 19. Reverse hillclimbing from 1000 random 4-state busy beaver peaks.
AA LA MA SA
# of basin sizes 470 440 320 356
Min. basin size 4 4 1 2
Max. basin size 545,848 360,864 42,970 61,606
Mean basin size 4305 2585 481 777
Basin size s.d. 27,647 16,521 2628 4282
P(Discovery) 00000075 00000066 00000071 00000085
E(Climbs) 133,754 151,713 141,453 117,993
4.10. Conclusion
This chapter introduced the reverse hillclimbing algorithm. This chapter showed that
the algorithm makes it possible to determine the relative performance of two hillclimb-
ing algorithms (that only take uphill steps) on a given problem. The algorithm also
provides answers to several other questions concerning sizes of basins of attraction
and lengths of paths to discover peaks. To answer these questions without reverse
hillclimbing, it appears necessary to use statistical methods based on the occasional
discovery of peaks. These methods are both computationally infeasible and of lim-
ited accuracy. Reverse hillclimbing, when it is useful, provides answers to the above
questions very rapidly and with a high degree of precision. In many cases answers
obtained with the method are exact.
The reverse hillclimbing algorithm cannot be used to discover high peaks on
a tness landscape. Rather, it is used as the basis for analysis once these peaks
have been found by other methods. For instance, if a certain hillclimbing method
has identi ed 100 peaks on a \rugged" landscape, reverse hillclimbing can be used
to calculate the probability that other hillclimbers would nd these peaks. If some
other search algorithm has identi ed a good peak, reverse hillclimbing can produce
the exact probability that a given form of hillclimbing would have found the point.
This allows rapid and accurate comparisons without the need to run the hillclimber
114
Table 20. Reverse hillclimbing from the 48 peaks in the 4-state busy beaver problem.
AA LA MA SA
# of basin sizes 6 5 24 4
Min. basin size 10,420 3349 897 1426
Max. basin size 25,714 5648 2660 5364
Mean basin size 18,072 4511 1552 3370
Basin size s.d. 7637 1133 669 1936
P(Discovery) 0000001169 0000000610 0000001002 0000002012
E(Climbs) 855,538 1,639,043 997,524 497,129
to form an estimate of the probability. In previous work "110], this approach was used
to assess the performance of a GA.
Reverse hillclimbing has the pleasing property of becoming more feasible as the
landscape on which it is run gets more \rugged." That is, as basin sizes decrease
(perhaps due to increased numbers of peaks, as we saw above with NK landscapes)
reverse hillclimbing becomes increasingly competitive with sampling methods based
on hillclimbing. On extremely dicult problems, reverse hillclimbing will perform
extremely well since it never strays more than one vertex from the points in the basin
of attraction during its descent. Therefore, the smaller the basin of attraction of
a peak, the faster reverse hillclimbing will be and the more dicult it becomes to
answer questions about the basin using other methods.
Other statistics about landscapes can also be easily investigated via reverse hill-
climbing. These include the relationships between peak heights, basin sizes and prob-
ability of discovery. A trivial modi cation to the algorithm can be used to explore
the \basin" of points above a certain point in a landscape. This allows calculation of
statistics relating the number of uphill directions to the number of peaks that can be
reached from a given point. This information might be used to direct the behavior
of a hillclimber that expended variable e
ort in looking for uphill neighbors at each
stage of the climb.
115
Table 21. The result of 10,000 hillclimbs on the 4-state busy beaver landscape. The
table shows data for the number of uphill steps taken for each algorithm and the
number of evaluations performed per uphill walk.
AA LA MA SA
Av. steps/climb 2091 2242 2111 1883
Min. steps/climb 0 0 0 0
Max. steps/climb 8 9 8 8
Steps/climb s.d. 1316 1461 1307 1165
Av. evals/climb 8335 1566 1503 1394
Min. evals/climb 49 49 49 49
Max. evals/climb 200 481 433 433
Evals/climb s.d. 2582 7014 6273 5592
Finally, reverse hillclimbing has highlighted the tradeo
between moving uphill
quickly and moving in the steepest direction possible. The any ascent algorithm takes
the rst uphill direction it can nd and consequently expends little energy looking for
uphill directions. Steepest ascent always examines all neighbors and then chooses the
steepest of these. Although, in our experience, steepest ascent always has a higher
probability of nding a peak on a single climb (this can be shown exactly by reverse
hillclimbing for a given problem), this advantage is usually not enough to compensate
for the extra work it must do at each step. These two approaches represent extremes
of behavior. Appendix B presents preliminary results of an attempt to nd more
balanced hillclimbers.
116
Table 22. Expected and observed numbers of peaks and evaluations for the 48 peaks
on the 4-state busy beaver problem. The observed values are taken from 20 million
hillclimbs with each of the four algorithms.
AA LA MA SA
E(Peaks found) 23 12 20 40
Peaks found 23 15 23 44
E(Evals/peak) 71,307,723 256,657,743 149,960,575 69,294,214
Evals/peak 71,270,514 208,393,004 130,625,970 63,161,171
Table 23. Reverse hillclimbing from 1000 random 3-state busy beaver peaks.
AA LA MA SA
# of basin sizes 287 279 212 224
Min. basin size 4 4 2 2
Max. basin size 20,219 13,616 7052 8142
Mean basin size 327 222 108 137
Basin size s.d. 1309 869 409 502
P(Discovery) 000368 000329 000364 000407
E(Climbs) 272 304 275 246
117
Table 24. Reverse hillclimbing from the 40 peaks in the 3-state busy beaver problem.
AA LA MA SA
# of basin sizes 13 14 22 14
Min. basin size 5659 2522 1247 2092
Max. basin size 20,219 13,616 7052 8142
Mean basin size 10,974 6702 3249 4156
Basin size s.d. 6013 4746 2183 2426
P(Discovery) 000399 000337 000381 000453
E(Climbs) 251 297 262 221
118
Table 25. The result of 10,000 hillclimbs on the 3-state busy beaver landscape. The
table shows data for the number of uphill steps taken for each algorithm and the
number of evaluations performed per uphill walk.
AA LA MA SA
Av. steps/climb 1754 1829 1762 1591
Min. steps/climb 0 0 0 0
Max. steps/climb 7 7 7 6
Steps/climb s.d. 1114 1189 1116 1005
Av. evals/climb 5065 8587 8386 7872
Min. evals/climb 31 31 31 31
Max. evals/climb 136 241 241 211
Evals/climb s.d. 1564 3568 3347 3014
Table 26. Expected and observed numbers of peaks and evaluations for the 48 peaks
on the 3-state busy beaver problem. The observed values are taken from 10,000
hillclimbs with each of the four algorithms.
AA LA MA SA
E(Peaks found) 40 34 38 45
Peaks found 49 29 37 48
E(Evals/peak) 12,687 25,463 21,990 17,367
Evals/peak 12,662 29,609 22,665 16,400
119
Table 27. Reverse hillclimbing from 1000 random 2-state busy beaver peaks.
AA LA MA SA
# of basin sizes 37 35 27 27
Min. basin size 4 4 4 3
Max. basin size 372 292 200 274
Mean basin size 144 121 110 113
Basin size s.d. 264 207 142 190
P(Discovery) 0409 0393 0406 0425
E(Climbs) 2448 2545 2460 2355
Table 28. Reverse hillclimbing from the 4 peaks in the 2-state busy beaver problem.
AA LA MA SA
# of basin sizes 2 2 2 2
Min. basin size 370 290 198 273
Max. basin size 372 292 200 274
Mean basin size 371 291 199 2735
Basin size s.d. 1 1 1 05
P(Discovery) 00354 00321 00304 00390
E(Climbs) 2827 3117 3285 2564
120
Table 29. The result of 10,000 hillclimbs on the 2-state busy beaver landscape. The
table shows data for the number of uphill steps taken for each algorithm and the
number of evaluations performed per uphill walk.
AA LA MA SA
Av. steps/climb 1372 1446 1374 1305
Min. steps/climb 0 0 0 0
Max. steps/climb 5 5 5 5
Steps/climb s.d. 0917 0980 0911 0838
Av. evals/climb 2626 4014 3899 3787
Min. evals/climb 17 17 17 17
Max. evals/climb 76 97 97 97
Evals/climb s.d. 7898 1568 1457 1341
Table 30. Expected and observed numbers of peaks and evaluations for the 48 peaks
on the 2-state busy beaver problem.
AA LA MA SA
E(Peaks found) 354 321 304 390
Peaks found 356 317 290 364
E(Evals/peak) 742 1251 1281 971
Evals/peak 737 1266 1344 1040
CHAPTER 5
6 8 9
12 2 13 1
15 7 4 5
Figure 51. The 15-puzzle. The object is to rearrange the fteen numbered tiles to
form an increasing sequence from top left to bottom right. Tiles are rearranged via
the movement of a numbered tile into the vacant (dark) location.
imagine the entire set of possible arrangements (states) in a connected fashion such
as that shown in Figure 52. Puzzle con gurations are connected through the action
of sliding a tile into the unoccupied location. This is known as a state space and the
problem can be viewed as that of nding a path through this network to the unique
solution state.
123
14 3 10 11 3 10 11 14 10
11 6 8 9 6 14 8 9 6 8 3 9
12 2 13 1 12 2 13 1 12 2 13 1
15 7 4 5 15 7 4 5 15 7 4 5
11 14 3 10 11 14 3 10 11 14 3 10
6 8 9 6 8 9 6 8 9
12 2 13 1 12 2 13 1 12 2 13 1
15 7 4 5 15 7 4 5 15 7 4 5
11 14 3 10 11 14 3 10 11 14 3 10
12 6 8 9 6 2 8 9 6 8 13 9
2 13 1 12 13 1 12 2 1
15 7 4 5 15 7 4 5 15 7 4 5
Figure 52. A portion of the state space for the 15-puzzle. Sliding a tile into the
unoccupied location transforms one state into another.
124
operators in the natural state space. Rubik's Cube is useful for illustrating this choice.
The natural state space representation for the cube has each con guration of the
cube correspond to a state in the state space graph, with edges emanating from a
vertex given by the possible twists of the cube's sides. A second choice is to construct
a state space with only three elements, one being the set of all cube con gurations
that have one layer complete, the second the set of all cube con gurations with two
layers complete and the third the completed cube. This state space would be very
easy to search if we could nd the operators that connected the states. In a third
view, used by Korf "33], the individual \cubies" and the possible cubie positions are
labeled from 1 to 20. State i:j (for 1 i 20 and i < j 20) indicates that cubies
numbered less than or equal to i are all positioned correctly and that cubie i + 1 is
in position j . When i = 20 the puzzle is solved and the value of j is irrelevant. In
this state space, each state corresponds to a set of cube con gurations. This state
space also has the advantage of being trivial to navigate in. Fortunately, it is possible
(though by no means easy) to nd operators that connect successive states "33]. Korf
calls the sequences of cube twists that make up the operators in this state space
macro-operators and attributes the idea to Amarel "32]. The problem of nding such
operators is a challenging search problem in itself. Korf's preference is to continue to
regard the natural state space as the state space under investigation, hence the name
macro-operator. Mine is to treat sets of states of the natural state space as states
in a new state space and the macro-operators as operators in the new state space.
The relation \in the same Korf i:j class as" is an equivalence relation over the set
of all cube con gurations. Each of the equivalences classes de ned by this relation
can be regarded as a state in a new state space. Allowing a state in a state space
to correspond to a set of states from another state space is merely an example of
allowing a landscape vertex to correspond to a multiset of elements of R.
These choices are all possible approaches to solving Rubik's cube by viewing the
search as taking place in a state space. The rst state space is simple to construct
and hard to search and the second and third are hard to construct and simple to
search. This tradeo
is exactly the tradeo
that we encountered in x2.11(41) when
discussing operators and representation in the context of landscapes. The issues that
we face when constructing a state space and operators to move between states are
the same as those that we face when choosing an object space (and, subsequently, a
representation space) and operators when constructing a landscape. In both cases we
need to decide what we will focus on as possible solutions to the problem and how we
will transform these possible solutions into others. In both cases, the results of these
choices can be viewed as a graph in which the search will take place.
127
1. A global database,
2. A set of production rules, and
3. A control system.
The rules consist of preconditions and actions. The preconditions may match
some aspect of the global database, in which case the rule may be applied to
the database, thereby changing it. The control system determines which rules
it is appropriate to invoke, resolves conicts (when the preconditions of many
rules are satis ed) and brings the search to a halt when the global database
satis es some stopping criterion. This description may seem to have little to
do with evolutionary algorithms. Actually, it has everything to do with these
algorithms. If one substitutes C (the collection of elements of the representation
space in the landscape model), operators and navigation strategy for global
database, production rules and control system in the above, the description ts
the landscape model almost perfectly.
the elds rather than an indication that the correspondence is weak. For example,
AI and OR algorithms commonly operate on partial individuals but this is not the
case in evolutionary algorithms. Rather than challenging the correspondence between
the elds, one may be tempted to devise evolutionary algorithms that explicitly ma-
nipulate partial individuals. As another example, the values attached to vertices by
an heuristic function in AI are often interpreted as a distance to a goal, but this is
never done in evolutionary algorithms. The result of viewing the tness function of
an evolutionary algorithm as an heuristic (distance providing) function is the subject
of the following sections.
Such a tness function is exactly what is sought in an heuristic function for many
AI search algorithms. In these algorithms, the value attached to a vertex by the
evaluation function is often interpreted as a distance. For example, in A "139],
search from a state n proceeds according to a function f (n) = g(n) + h(n) where
g(n) is a function estimating the minimum distance from the starting state to state
n and h(n) is an heuristic function estimating the minimum distance from n to the
goal state. There are many results that show that the better an estimate h(n) is to
the function h(n), which gives the exact distance to the goal, the better an heuristic
search algorithm will perform. AI is typically concerned with admissible heuristic
functions, in which 8n h(n) h(n). However, the original descriptions of searching
labeled graphs suggested only that h(n) be correlated with h(n) "142].
The type of tness function we would most like in evolutionary algorithms is
exactly what is desirable as an heuristic function in AI search algorithms. If we as-
sume that the closer our tness functions approximate the AI ideal the easier search
will be, and can quantify how well they do this, we have a measure of search di-
culty. The usefulness of the measure will provide an indication of how realistic the
original assumption was. The remainder of this chapter introduces a measure of this
correlation and investigates its usefulness as a predictor of GA performance.
5.6. GA Diculty
The search for factors a
ecting the ability of the GA to solve optimization problems
has been a major focus within the theoretical GA community. Horn and Goldberg
"143] recently stated \If we are ever to understand how hard a problem GAs can
solve, how quickly, and with what reliability, we must get our hands around what
`hard' is." The most visible attempt at pinning down what it is that makes search
more or less dicult for GAs is work on the theory of deceptive functions, which
has been developed by Goldberg "104] and others, based upon early work by Bethke
"144]. However, researchers seem quite divided over the relevance of deception, and
we have seen reactions ranging from proposals that \the only challenging problems
are deceptive" "145, 146] to informal claims that deception is irrelevant to real-world
problems, and Grefenstette's demonstration that the presence of deception is not
necessary or sucient to ensure that a problem is dicult for a GA "147]. In addition,
the approach has not been generalized to GAs that do not operate on binary strings
it requires complete knowledge of the tness function quantifying deception can be
dicult computationally the theory is rooted in the schema theorem, which has also
been the subject of much recent debate and nally, non-deceptive factors such as
133
spurious correlations (or hitch-hiking) "96, 148] have been shown to adversely a
ect
the GA. Kargupta and Goldberg have recently considered how signal and noise
combine to a
ect search "149, 150]. They focus on how the dynamics of schema
processing during the run of a GA alters measures of signal and noise. This method
is promising, and provides plausible explanations for GA performance on a number
of problems, some of which are considered here.
Another attempt to capture what it is that makes for GA diculty is centered
around the notion of \rugged tness landscapes." At an informal level, it is commonly
held that the more rugged a tness landscape is, the more dicult it is to search.
While this vague statement undoubtedly carries some truth, \ruggedness" is not
easily quanti ed, even when one has de ned what a landscape is. Unfortunately, the
informal claim also breaks down. For example, Ackley "26] and Horn and Goldberg
"143] have constructed landscapes with a provably maximal number of local optima,
but the problems are readily solved by a GA. At the other extreme, a relatively
smooth landscape may be maximally dicult to search, as in \needle in a haystack"
problems. Thus, even before we can de ne what ruggedness might mean, it is clear
that our intuitive notion of ruggedness will not always be reliable as an indicator of
diculty and we can expect that it will be extremely dicult to determine when the
measure is reliable.
The most successful measure of ruggedness developed to date has been the cal-
culation of \correlation length" by Weinberger "65] which was the basis for the work
of Manderick et al. "60]. Correlation length is based on the rate of decrease in correla-
tion between parent and o
spring tness, and clearly su
ers from the above problem
with relatively at landscapes|correlation length is large (indicating an easy search
problem) but the problem may be very dicult. Additionally, parent/o
spring tness
correlation can be very good even when the gradient of the landscape is leading away
from the global maximum, as in deceptive problems. Associated with ruggedness is
the notion of \epistatic interactions," which were the basis of a proposed viewpoint
on GA diculty proposed by Davidor "151]. Although it is clear that some highly
epistatic landscapes are dicult to search "117, 118], it is not clear how much epis-
tasis is needed to make a problem dicult and Davidor's measure is not normalized,
making comparisons between problems dicult. Also, the method does not provide
con dence measures is computationally \not economical," and will have the same
problems on landscapes with little or no epistasis (because they are relatively at) as
described above.
As nal testimony to the claim that we have not yet developed a reliable indica-
tor of GA hardness, there have been several surprises when problems did not prove
134
as easy for a GA as had been expected. Tanese "152] constructed a class of Walsh
polynomials of xed order and found that a GA encountered diculty even on the-
oretically easy low-order polynomials. In an attempt to study GA performance in a
simple environment, Mitchell et al. constructed the \royal road" functions "36]. They
compared the GA's performance on two royal road functions, one of which contained
intermediate-size building blocks that were designed to lead the GA by the hand to
the global optimum. Surprisingly, the GA performed better on the simpler function
in which these intermediate building blocks did not exist.
All these notions of what makes a problem hard for a GA all have something to
recommend them, but all seem to be only a piece of the whole story. It is clear that
we are still some way from a good intuition about what will make a problem hard
for a GA. I propose that it is the relationship between tness and distance to the
goal that is important for GA search. This relationship is apparent in scatter plots
of tness versus distance and is often well summarized by computing the correlation
coecient between tness and distance.
the object of the search (usually a global maxima). It is likely that a better statistic
would be obtained if distances were computed using the operator that de ned the
edges of the landscape graph, though these will be more dicult to compute. Ham-
ming distance is a simple rst approximation to distance under the actual operators
of a GA. That Hamming distance works well as a predictor of GA performance is
perhaps a result of its close relationship to distance under mutation. These issues are
discussed in x5.7.2(158).
I will use r (FDC) as a measure of problem diculty. Correlation works best as
a summary statistic of the relationship between two random variables if the variables
follow a bivariate normal distribution. There is no guarantee that this will be the case
if we have a random sample of tnesses, and there are therefore situations in which r
will be a poor summary statistic of the relationship between tness and distance. I am
not claiming that correlation is necessarily a good way to summarize the relationship
between tness and distance. I do claim that this relationship is what is important.
In practice, examining the scatter plot of tness versus distance is very informative
in the cases where there is a structure in this relationship that cannot be detected
by correlation. It is important to realize that correlation is only one of the possible
ways that the relationship between tness and distance can be examined. It appears
quite useful, although we will see examples of problems for which it is too simplistic.
In all of these cases, the scatter plots are useful for revealing the shortcomings of
correlation.
Trap(20)
1.0 Liepins & Vose (16)
0.9
Trap(12)
0.8
Trap(10) 0.7
0.6
Trap(9)
0.5 Whitley 4-bit fully deceptive F2
0.1
Figure 53. Summary of results. Horizontal position is merely for grouping, vertical
position indicates the value of r. Abbreviation explanations and problem sources are
given in Table 32.
137
in this gure, together with a short description of the problems and their sources can
be found in Table 32. Figures 54 and 104 show examples of scatter plots of tness and
distance from which r is computed. The plots represent all the points in the space
unless a number of samples is mentioned. In these scatter plots, a small amount of
noise has been added to distances (and in some cases tnesses) so that identical t-
ness/distance pairs can be easily identi ed. This was suggested by Lane "153] and in
many cases makes it far easier to see the relationship between tness and distances.
This noise was not used in the calculation of r, it is for display purposes only.
5.7.1.1. Conrmation of Known Results
This section investigates the predictions made by FDC on a number of problems that
have been relatively well-studied. These include various deceptive problems, and
other simply de ned problems.
Easy Problems
Ackley's \one max" problem "26] was described in x3.8.1(58). According to the FDC
measure, this problem is as simple as a problem could be. It exhibits perfect negative
correlation (r = ;1), as is shown in Figure 54. The one max tness function is
essentially the ideal tness function described above. The distance to the single
global optimum is perfectly correlated with tness. Ackley's \two max" problem is
also correctly classi ed as easy by FDC (r = ;041). This function has two peaks,
both with large basins. For binary strings of length n, the function is de ned as
f (x) = j18u(x) ; 8nj:
For K < 3, the NK landscape problems (described in x4.7.1(97)) produce high
negative correlation (;083, ;055 and ;035), though r moves rapidly towards zero
as K increases, which qualitatively matches the increases in search diculty found
by Kau
man "117, 118] and others. As NK landscapes are constructed from a table
of N 2K+1 random numbers, the r value for each K value is the mean of ten di
erent
landscapes. Figures 56 and 58 show three NK landscapes for N = 12 and K = 1, 3
and 11. When K = 11 the landscape is completely random, and this is reected by
an r value that is very close to zero.
One, two and three instances of Deb and Goldberg's "105] 6-bit fully easy problem
(described in x3.8.2(59)) are shown in Figures 59 and 61. Interestingly, each of these
problems has r = ;02325, which is an indication that in some sense the problem
diculty is not a
ected by changing the number of copies of the same subfunction.
138
Table 32. The problems of Figure 53. Where a problem has two sources, the rst
denotes the original statement of the problem and the second contains the description
that was implemented.
Abbreviation Problem Description Source
BBk Busy Beaver problem with k states. "106, 110]
Deb & Goldberg 6-bit fully deceptive and easy functions. "105]
Fk(j ) De Jong's function k with j bits. "30, 5]
GFk(j ) As above, though Gray coded. "30, 5]
Goldberg, Korb & Deb 3-bit fully deceptive. "154, 145]
Grefenstette easy The deceptive but easy function. "147]
Grefenstette hard The non-deceptive but hard function. "147]
Holland royal road Holland's 240-bit royal road function. "112, 155]
Horn, Goldberg & Deb The long path problem with 40 bits. "86]
Horn & Goldberg A 33-bit maximally rugged function. "143]
Liepins & Vose (k) Deceptive problem with k bits. "47, 156]
Mix(n) Ackley's mix function on n bits. "26]
NIAH Needle in a haystack. p. 148
NK(n k) Kau
man's NK landscape. N = n, K = k. "117]
One Max Ackley's single -peaked function. "26]
Plateau(n) Ackley's plateau function on n bits. "26]
Porcupine(n) Ackley's porcupine function on n bits. "26]
R(n b) Mitchell et al. n-bit royal road, b-bit blocks. "36, 96]
Tanese (l n o) l-bit Tanese function of n terms, order o. "152, 157]
Trap(n) Ackley's trap function on n bits. "26]
Two Max(n) Ackley's two -peaked function on n bits. "26]
Whitley Fk 4-bit fully deceptive function k. "145]
139
8
80
6
60
Fitness
Fitness
4 40
2 20
0
2 4 6 8 0 2 4 6 8
Distance Distance
Figure 54. Ackley's one max problem on 8 Figure 55. Ackley's two max problem on 9
bits (r = ;1). Ackley's tnesses were actually bits (r = ;041).
ten times the number of ones, which has no
aect on r.
Although an algorithm may require more resources to solve a problem with more
subfunctions, the problem diculty, in the eyes of FDC, does not increase. This is
intuitively appealing. It is very hard to dissociate problem diculty from algorithm
resources, yet the FDC measure can be interpreted as \recognizing" that solving the
same problem twice (perhaps simultaneously) is no harder than solving it once. Of
course it will require more resources, but it can be argued that it is not more dicult.
It is possible to prove that FDC remains the same when any number of copies of a
function are concatenated in this manner. For this reason, we will see this behavior
in all the problems below that involve multiple identical subproblems. The proof of
this invariance is given in Appendix D.
Figure 62 shows Grefenstette's deceptive but easy problem "147] (r = ;033). In
this problem, two variables x1 and x2 are encoded using 10 bits each. The problem
is to maximize
8>
< x21 + 10x22 if x2 < 0995.
f (x1 x2) = >
: 2(1 ; x1)2 + 10x22 if x2 0995.
While this problem is highly deceptive (at least under some de nitions of deception),
it is simple for a GA to optimize. FDC correctly classi es the problem as simple.
Ackley's maximally rugged \porcupine" function is shown in Figure 64. In this
140
0.8
0.7
0.6
0.6 0.6
0.5
Fitness
Fitness
Fitness
0.5
0.4
0.4
0.4 0.3
0.2
0 2 4 6 8 10 12 0 3 6 9 12 0 2 4 6 8 10 12
Distance Distance Distance
2
1
2.5
Fitness
Fitness
0.8
Fitness
1.5
0.6
0 2 4 6 0 3 6 9 12 3 6 9 12 15
Distance Distance Distance
Figure 59. Deb & Goldberg's Figure 60. Two copies of Deb Figure 61. Three of Deb &
fully easy 6-bit problem (r = & Goldberg's fully easy 6-bit Goldberg's fully easy 6-bit prob-
;023). problem (r = ;023). lems (r = ;023, 4000 sampled
points).
function, every binary string with even parity is a -local-maximum. For a binary
string of even length n, the function is de ned as
8>
< 10u(x) ; 15 if u(x) is odd.
f (x) = >
: 10u(x) otherwise.
Despite the ruggedness, the function is not dicult to optimize (unless your algorithm
never goes downhill and only employs an operator that changes a single bit at a time).
FDC is very strongly negative (r = ;088).
Horn and Goldberg's maximally rugged function, shown in Figure 65 is very
similar and also exhibits strong negative correlation (r = ;083). Given a binary
141
1
10
Fitness
Fitness
0.5
5
3 6 9 12 15 18 2 4 6 8 10
Distance Distance
Figure 62. Grefenstette's deceptive but easy Figure 63. Grefenstette's non-deceptive but
20-bit problem (r = ;032, 4000 sampled hard 10-bit problem. The single point with
points). tness 2048 is omitted from the plot. When
included, r = ;009, when excluded, r = 053).
80
10
60
Fitness
Fitness
40
5
20
0 2 4 6 8 0 2 4 6 8
Distance Distance
Figure 64. Ackley's porcupine problem on 8 Figure 65. Horn & Goldberg's maximum
bits (r = ;088). modality problem on 9 bits (r = ;083).
Deceptive Problems
An early fully deceptive problem is that given by Goldberg, Korb and Deb "154]. It
is de ned over three bits as follows:
f (000) = 28 f (100) = 14
f (001) = 26 f (101) = 0
f (010) = 22 f (110) = 0
f (011) = 0 f (111) = 30
Figures 67 and 69 show two to four concatenated copies of this function for which
r = 032. Deb and Goldberg's 6-bit fully deceptive problem "105] (described in
x3.8.3(60)) has r = 030. Figures 70 and 72 show one to three copies of this function.
As with the fully easy subproblems described above, the concatenation of several
deceptive subproblems does not a
ect r.
Ackley's \trap" function is de ned on binary strings of length n as follows
8>
< (8n=z)(z ; u(x)) if u(x) z.
f (x) = >
: (10n=(n ; z))(u(x) ; z) otherwise.
143
150
100
Fitness
50
0
3 6 9 12 15
Distance
Figure 66. Ackley's mix problem on 20 bits (r = ;044, 4000 sampled points).
where z = b3n=4c. This function has two -peaks the higher of which has a small
basin of attraction. Deb and Goldberg showed that the function is not fully deceptive
for n < 8 "158]. However, the problem becomes increasingly dicult (from the point of
view of ) as n increases, as the basin of attraction (under ) of the global maximum
(the string with n ones) only includes those points with u(x) > z, a vanishingly small
fraction of the entire space "41, page 22]. FDC becomes increasingly strongly negative
as n is increased. For n = 9, 10, 12 and 20, the r values obtained are 056, 071, 088
and 098 respectively. The last of these is from a sample of 4,000 points and the
scatter plot is shown in Figure 73.
In "145], Whitley discusses three fully deceptive functions. The rst of these is
the 3-bit problem due to Goldberg, Korb and Deb "154] just discussed. The second
and third functions (which will be referred to as Whitley's F2 and F3) both had four
144
60
80
100
40 60
Fitness
Fitness
Fitness
40 50
20
20
0 2 4 6 0 2 4 6 8 0 3 6 9 12
Distance Distance Distance
Figure 67. Two copies of Gold- Figure 68. Three copies of Figure 69. Four copies of Gold-
berg, Deb & Korb's fully decep- Goldberg, Deb & Korb's fully berg, Deb & Korb's fully decep-
tive 3-bit problem (r = 032). deceptive 3-bit problem (r = tive 3-bit problem (r = 032).
032).
1 2
2
0.8
1.5
Fitness
Fitness
Fitness
0.6 1.5
1
0.4
1
0 1 2 3 4 5 6 0 3 6 9 12 3 6 9 12 15
Distance Distance Distance
Figure 70. Deb & Goldberg's Figure 71. Two copies of Deb Figure 72. Three copies of Deb
fully deceptive 6-bit problem & Goldberg's fully deceptive 6- & Goldberg's fully deceptive 6-
(r = 030). bit problem (r = 030). bit problem (r = 030, 4000
sampled points). Notice the ad-
ditive eect.
bits. The function values for the 16 four bit strings for F2 are as follows:
f (0000) = 28 f (0100) = 22 f (1000) = 20 f (1100) = 8
f (0001) = 26 f (0101) = 16 f (1001) = 12 f (1101) = 4
f (0010) = 24 f (0110) = 14 f (1010) = 10 f (1110) = 6
f (0011) = 18 f (0111) = 0 f (1011) = 2 f (1111) = 30
For one, two and three copies of Whitley's F2, r = 051. Scatter plots of these are
shown in Figures 74 and 76.
145
100
Fitness
50
3 6 9 12 15 18
Distance
Figure 73. Ackley's trap function on 20 bits (r = 098, 4000 sampled points).
30 60
80
20 40 60
Fitness
Fitness
Fitness
40
10 20
20
0 1 2 3 4 0 2 4 6 8 0 3 6 9 12
Distance Distance Distance
Figure 74. Whitley's F2. A Figure 75. Two copies of Whit- Figure 76. Three copies of
fully deceptive 4-bit problem ley's F2 (r = 051). Whitley's F2 (r = 051).
(r = 051).
30 60
80
20 40 60
Fitness
Fitness
Fitness
40
10 20
20
0 1 2 3 4 0 2 4 6 8 3 6 9 12
Distance Distance Distance
Figure 77. Whitley's F3. A Figure 78. Two copies of Whit- Figure 79. Three copies of
fully deceptive 4-bit problem ley's F3 (r = 037). Whitley's F3 (r = 037).
(r = 037).
2
2
Fitness
Fitness
1
1
0 0
8 12 16 20 24 48 56 64 72 80
Distance Distance
Figure 80. Holland's royal road on 32 bits Figure 81. Holland's royal road on 128 bits
(b = 8, k = 2 and g = 0), (r = 025, 4000 (b = 8, k = 4 and g = 0), (r = 027, 4000
sampled points). sampled points).
100
Fitness
50
0 2 4 6 8 10
Distance
Figure 82. Horn, Goldberg & Deb's long path problem with 11 bits (r = ;012).
Notice the path.
points on the path, it is strongly negative, as it is for the points not on the path (e.g.,
for strings with 12 bits, we get r = ;039 and r = ;067) but combining the samples
gives a much lower correlation (r = ;019 for 12 bits). This is a rst illustration
of how correlation may sometimes prove too simplistic a summary statistic of the
relationship between tness and distance. Fortunately, the striking structure of the
problem is immediately apparent from the scatter plot, as can be seen in Figure 82.
Zero Correlation
The needle in a haystack (NIAH) problem is de ned on binary strings of length n as
follows: 8
>< 1 if u(x) = n.
f (x) = >
: 0 otherwise.
The function is everywhere zero except for a single point. When we compute FDC
exhaustively (or in a sample which includes the needle), r is very close to zero. This
illustrates how FDC produces the correct indication when a function is very at.
149
In such cases, measures such as correlation length will indicate that the problem is
simple whereas it is actually maximally hard. If the sample used to compute FDC
does not include the needle, the correlation is unde ned as there is no variance in
tness. In this case it can be concluded that the problem is dicult (i.e., similar to
r = 0) or some amount of noise can be added to tness values, which will also result
in r 0. A similar NIAH problem is the following,
8>
< 100 if u(x) = n.
f (x) = >
: uniform "0::1] otherwise.
where uniform "0::1] is a function that returns a real value chosen uniformly at random
from the interval "0::1]. For the NIAH problem, r will be approximately zero whether
the needle is sampled or not.
The 2-, 3-, and 4-state busy beaver problems (described in x3.8.5(60)) (Figures 83
and 85) and the NK(12,11) landscape (Figure 58 on page 140), all known to be dicult
problems, also had r approximately 0.
4 4
4
Fitness
Fitness
Fitness
2 2
2
0 0 0
0 2 4 6 8 2 4 6 8 10 4 6 8 10 12 14
Distance Distance Distance
Figure 83. 2-state busy beaver Figure 84. 3-state busy beaver Figure 85. 4-state busy beaver
problem (r = ;006, 4000 sam- problem (r = ;008, 4000 sam- problem (r = ;011, 4000 sam-
pled points). pled points). pled points).
Some of De Jong's functions have very low r values. For example, F1(15) (Fig-
ure 86) has r = ;001. Though the correlation measures correctly predict that F1(15)
will be harder for the GA than GF1(15), the low correlation for F1(15) is misleading.
Looking at the scatter plot in Figure 86 it is clear that the function contains many
highly t points at all distances from the global optimum. From this, it is reasonable
to expect that a GA will have no trouble locating a very high tness point.
150
60
60
40
Fitness
Fitness
40
20 20
0 0
3 6 9 12 3 6 9 12 15
Distance Distance
Figure 86. De Jong's F1 binary coded with Figure 87. De Jong's F1 Gray coded with 15
15 bits converted to a maximization problem bits (r = ;030, 4000 sampled points).
(r = ;001, 4000 sampled points).
Tanese Functions
Tanese found that on Walsh polynomials on 32 bits with 32 terms each of order 8
(which we will denote via T(32,32,8)) a standard GA found it very dicult to locate
a global optimum "152]. FDC gives an r value very close to 0 for all the instances of
T(32,32,8) considered, as it does for T(16,16,4) functions. When the number of terms
is reduced, the problem becomes far easier, for instance, T(16,8,4) functions typically
have an r value of approximately ;037. This contrast is illustrated by Figures 88
and 89. These results are consistent with the experiment of Forrest and Mitchell "157]
who found that increasing string length made the problem far simpler in T(128,32,8)
functions. It is not practical to use FDC on a T(128,32,8) function since it is possible
to show that they have at least 296 global optima and the FDC algorithm requires
computing the distance to the nearest optimum.
151
20
50
10
Fitness
Fitness
0
0
-10
-50
0 1 2 3 4 8 10 12 14 16
Distance Distance
Figure 88. Tanese function on 16 bits with Figure 89. Tanese function on 32 bits with
8 terms of order 4 (r = ;037, 4000 sampled 32 terms of order 8 (r = 003, 4000 sampled
points). points).
40
15
30
10
Fitness
Fitness
20
5
10
8 12 16 20 24 8 12 16 20 24
Distance Distance
Figure 90. Royal road function R1 with 8 Figure 91. Royal road function R2 with 8
blocks of length 4 (r = ;052, 4000 sampled blocks of length 4 (r = ;050, 4000 sampled
points). points).
100
Fitness
50
0
0 3 6 9 12
Distance
Experiments with a GA have indicated that these predictions are accurate. For
example, consider the positions in Figure 53 of F2(n) and GF2(n). F2 is a problem
on two real variables, so F2(n) indicates that n=2 bits were used to code for each
variable. When n = 8, we calculated r(F 2(8)) = ;024 whereas r(GF 2(8)) = ;006,
indicating that with 8 bits, binary coding is likely to make search easier than Gray
coding.1 Figures 93 and 94 show a clear di
erence in the encoding on 8 bits. But now
consider r(F 2(12)) = ;010 versus r(GF 2(12)) = ;041. With 12 bits, Gray coding
should be better than binary. The scatter plots for F2(12) and GF2(12) are shown in
Figures 95 and 96. When we move to 16 bits, we get r(F 2(16)) = r(GF 2(16)) = ;019
(Figures 97 and 98). Finally, with 24 bits (the number used by Caruana and Scha
er),
we have r(F 2(24)) = ;009 and r(GF 2(24)) = ;026 (Figures 99 and 100). Once
again, Gray coding should be better (as was found by Caruana and Scha
er when
considering online performance).
1 As the De Jong functions are minimization problems, positive r values are ideal. We have
inverted the sign of r for these functions, to be consistent with the rest of the dissertation. It is
straightforward to prove that this is the correlation that will obtain if we convert the problem to a
maximization problem via subtracting all original tnesses from a constant.
154
3000 3000
Fitness
Fitness
2000 2000
1000 1000
0 0
0 2 4 6 8 0 2 4 6 8
Distance Distance
Figure 93. De Jong's F2 binary coded with 8 Figure 94. De Jong's F2 Gray coded with 8
bits converted to a maximization problem (r = bits converted to a maximization problem (r =
;024). ;006).
3000 3000
Fitness
Fitness
2000 2000
1000 1000
0 0
2 4 6 8 10 12 0 2 4 6 8 10 12
Distance Distance
Figure 95. De Jong's F2 binary coded with Figure 96. De Jong's F2 Gray coded with 12
12 bits converted to a maximization problem bits converted to a maximization problem (r =
(r = ;010). The unusual appearance is due ;041).
to the presence of many highly t points and
\clis" in the encoding.
155
3000 3000
Fitness
Fitness
2000 2000
1000 1000
0 0
2 4 6 8 10 0 3 6 9 12
Distance Distance
Figure 97. De Jong's F2 binary coded with Figure 98. De Jong's F2 Gray coded with 16
16 bits converted to a maximization problem bits converted to a maximization problem (r =
(r = ;018). encoding. ;021).
3000
3000
Fitness
Fitness
2000 2000
1000 1000
0 0
6 9 12 15 18 21 4 8 12 16 20
Distance Distance
Figure 99. De Jong's F2 binary coded with Figure 100. De Jong's F2 Gray coded with
24 bits converted to a maximization problem 24 bits converted to a maximization problem
(r = ;009). (r = ;026).
157
30 30
Fitness
Fitness
20 20
10 10
0 0
3 6 9 12 3 6 9 12
Distance Distance
Figure 101. De Jong's F3 binary coded with Figure 102. De Jong's F3 Gray coded with 15
15 bits converted to a maximization problem bits (r = ;057, 4000 sampled points).
(r = ;086, 4000 sampled points).
400 400
300 300
Fitness
Fitness
200 200
100 100
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Distance Distance
Figure 103. De Jong's F5 binary coded with Figure 104. De Jong's F5 Gray coded with 12
12 bits converted to a maximization problem bits (r = ;057).
(r = ;086).
5.7.2. Discussion
There is an intuitively appealing informal argument that the correlation between
tness and distance is what is important for success in search. Suppose you get out
of bed in the night, hoping to make your way through the dark house to the chocolate
in the fridge in the kitchen. The degree to which you will be successful will depend
on how accurately your idea of where you are corresponds to where you actually are.
If you believe you're in the hallway leading to the kitchen but are actually in the
bedroom closet, the search is unlikely to end happily. This scenario is also the basis
of an argument against the claim that good parent/o
spring tness correlation is
what is important for successful search. Determining whether the oor is more or
less at will not help you nd the kitchen. If r has a large magnitude, we conjecture
that parent/child tness correlation will also be high. This is based on the simple
observation that if FDC is high, then good correlation of tnesses between neighbors
should be a consequence. Thus good parent/child tness correlation is seen as a
necessary but not sucient condition for a landscape to be easily searchable. If this
conjecture is correct, such correlation is not sucient as it will also exist when FDC
gives a value that is large and positive. This will likely be unimportant in real world
problems, if problems with large positive FDC are purely arti cial constructions of
159
1 1
Fitness
Fitness
0.5 0.5
0 2 4 6 8 10 0 2 4 6 8 10
Distance Distance
Figure 105. Liepins and Vose's fully deceptive Figure 106. The transform of Liepins and
problem on 10 bits (r = 098). Vose's fully deceptive problem on 10 bits (r =
;002). Correlation cannot detect the X struc-
ture.
the GA community.
It is far from clear what it means for a problem to be dicult or easy. As a
result it is inherently dicult to test a measure of problem diculty. A convincing
demonstration will need to account for variability in the resources that are used to
attack a problem, the size of the problem, variance in stochastic algorithms, and
other thorny issues. FDC will only give an indication of how hard it is to locate what
you tell it you are interested in locating. If you only tell it about global maxima,
it is unreasonable to expect information about whether a search algorithm will nd
other points or regions. If all the global optima are not known and FDC is run on a
subset of them, its results may indicate that the correlation is zero. When the other
optima are added, the correlation may be far from zero. FDC is useful in saying
something about problems whose solutions are already known. It can be hoped that
information on small examples of problems will be applicable to larger instances,
but in general this will not be the case. FDC is intended to be a general indicator
of search diculty, and is not speci c to any search algorithm. In particular, FDC
knows nothing whatsoever about a GA. This is both encouraging and alarming. It
is encouraging that the measure works well on a large number of problems, but is
alarming since, if we are to use it as a serious measure of problem hardness for a
GA, it should know something about the GA. Probably the best interpretation of
FDC values is as an indication of approximately how dicult a problem should be.
160
For example, if r = ;05 for some problem, but a GA never solves it, there is an
indication that the particular GA is doing something wrong.
Hamming distance is not a distance metric that applies to any of a GA's opera-
tors. Distance between strings s1 and s2 under normal GA mutation is more akin to
the reciprocal of the probability that s1 will be converted to s2 in a single application
of the mutation operator. Naturally, Hamming distance is strongly related to this
distance and this is presumably one reason why FDC's indications correlate well with
GA performance. A more accurate measure might be developed that was based on
the distances between points according to the operator in use by the algorithm. That
Hamming distance works at all as an indicator of GA performance is a strong hint that
a simplistic (i.e., easily computed) distance metric on permutations (for example, the
minimum number of remove-and-reinsert operations between two permutations) may
also prove very useful as a metric in FDC when considering ordering problems, even
if the algorithms in question do not make use of that operator. A similar approach,
also successful, was been adopted by Boese, Kahng and Muddu "116] (see x6.4(169)
for some details).
Because the computation of FDC relies on prior knowledge of global optima,
the measure does not appear well-suited for prediction of problem diculty. Ongoing
research is investigating an approach to prediction, based on the relationship between
tness and distance. In this approach, an apparently high peak is located (via some
number of hillclimbs) and FDC is computed as though this peak were the global
optimum.2 This gives an indication of how hard it is to nd that peak. This process
is repeated several times to get an overall indication of the diculty of searching
on the given landscape. It is expected that such a predictive measure could be
\fooled" in the same way that measures of correlation length can be fooled. That
is, it will classify a deceptive problem as easy because, if the single global optimum
is ignored, the problem is easy. In practice, it is probably more desirable that a
measure of problem diculty reports that such problems are easy. Only a very strict
and theoretical orientation insists that these problems be considered dicult.
Discussion in this section has concentrated mainly on tness and distance. It
should be noted that FDC is a
ected by not only tness function, but by operator
(which gives distance), by representation and by ; (the mapping from objects in the
real world to the representation used by the search algorithm). Because these will all
a
ect FDC, it follows that FDC should be useful in comparing all of these choices.
2 It is not even necessary for the vertex from which distance is computed to be a peak. A randomly
chosen vertex will still allow the calculation of FDC.
161
5.7.3. Conclusion
This chapter presented a perspective from which evolutionary algorithms may be
regarded as heuristic state space search algorithms. Much of the change in perspective
can be accomplished by a simple change in language that sets aside the biological
metaphor usually employed when describing evolutionary algorithms. One aspect
of the correspondence between the elds was examined in detail, the relationship
between tness and heuristic functions.
The relationship between tness and distance to goal has a great inuence on
search diculty for a GA. One simple measure of this relationship is the correlation
coecient between tness and distance (FDC) which has proved a reliable, though
not infallible, indicator of GA performance on a wide range of problems. On occasion,
correlation is too simplistic a summary statistic, in which case a scatter plot of tness
versus distance will often reveal the structure of the relationship between tness and
distance. FDC can be used to compare di
erent approaches to solving a problem.
For instance, FDC predicted that the relative superiority of binary and Gray coding
for a GA was dependent on the number of bits used to encode variables. Subsequent
empirical tests have supported this.
The development of FDC proceeded directly from thinking in terms of a model
of search which views the GA as navigating on a set of landscape graphs. AI has
long regarded search from a similar perspective, and a simple change in language
is sucient to view GAs as state-space search algorithms using heuristic evaluation
functions. In AI, the heuristic function is explicitly chosen to be as well correlated
with distance to the goal state as possible, and it is easy to argue that a similar
tness function in a GA will make for easy search. Having done that, it is a small
step to consider to what extent our current GA landscapes match this ideal, and to
use that as an indicator of search diculty. The fact that this proves successful is
likely to be unsurprising to those in the AI community who work on heuristic search
algorithms. I believe there is much that can be learned about GAs via considering
their relationship with heuristic state-space search.
CHAPTER 6
Related Work
6.1. Introduction
This chapter divides work related to this dissertation into three areas (1) work on
landscapes in other elds, (2) work on landscapes in computer science and evolu-
tionary computation, and (3) work related to the connection between landscapes and
heuristic search, especially to tness distance correlation. Several pieces of research
discussed below could be placed in more than one of these categories. This is merely
one way of roughly grouping related work.
These conicting interpretations of tness landscapes suggest that the metaphor
was an attractive one to many. The lack of a rigorous de nition did not prevent use
of the landscape metaphor from becoming widespread. Actually, lack of de nition
may have encouraged this. There is nothing inherently wrong with this situation if
landscapes are used to convey a vague picture of some process. When attempts are
made to use landscapes in a rigorous fashion, it becomes more important to know
what is being referred to. A similar situation exists in evolutionary computation,
where the word landscape tends to be used in a rather cavalier fashion. Landscapes
are frequently heard of but infrequently de ned. In some instances this is very useful.
In other cases, landscapes are used as the foundation of seemingly plausible arguments
that, when closely examined, can be extremely dicult to make any sense of at all.
These models are only relevant to systems in which an operator acts on a single
\individual" to produce another \individual." That is, one RNA sequence is
converted to another, or one Hamiltonian circuit in a graph is converted into
another.
The systems under consideration all involve a single operator. For example,
an operator which changes a spin in a spin glass or a point mutation operator
which changes a nucleotide in an RNA molecule.
All possible outcomes from an application of an operator have equal probability
of occurrence.
These di
erences make the application of these landscape models to evolutionary
algorithms problematic. In evolutionary algorithms, none of the above are true.
These algorithms have operators that act on and produce multiple individuals, they
employ multiple operators, and the possible results of operator application are not
equally probable. These possibilities are incorporated in the landscapes model of this
dissertation.
autocorrelation function can be calculated for these simplest of operators. They rec-
ognized crossover as an operator without making the generalization of the model of
this dissertation. To deal with this, they de ned a measure of operator correlation
which they applied successfully to one-point crossover on NK landscapes and to four
crossover operators for TSP. This statistic is calculated by repeated use of the op-
erator from randomly chosen starting points. Thus, they were actually computing
statistical measures about the crossover landscape without recognizing it as a land-
scape. Had they done so, they might have used the autocorrelation function, since
they dealt with the walkable landscapes generated by crossover operators that take
two parents and produce two children.
Taking the work of Manderick et al. as a starting point, Mathias and Whitley
"61] examined other crossover operators for TSP and Dzubera and Whitley develop
measures of \part" and \partial" correlation which they also apply to TSP. Much
of the work mentioned above in theoretical chemistry has also been inuenced by
Weinberger's work (see, for example, "62, 166, 167, 168]). An overview of uses of
the landscape perspective is far beyond the scope of this section. These works are
mentioned to show the extent of the inuence of Weinberger's model and de nition
of the autocorrelation function and correlation length.
The work of Nix and Vose and others on Markov chain analysis of GAs "52, 169,
170, 171, 172, 115] can also be looked at from a landscape perspective. As mentioned
in x2.3(23), the choice of what one regards as an operator is a matter of perspective.
Di
erent choices will impact the ease with which we can study di
erent aspects of
search algorithms. By viewing an entire generation of a GA as an operator that
converts one population to another, the GA can be imagined as taking a walk on a
landscape graph whose vertices correspond to populations. This operator, like any
other, can also be viewed as de ning a transition matrix for a Markov chain. Natu-
rally, for a process as complex as an entire GA generation, calculating the transition
probabilities will be quite involved, but this is what Nix and Vose have done. Because
this operator is used repeatedly (unlike simpler operators such as crossover and mu-
tation which are used in series), the expected behavior of a GA can be determined via
Markovian analysis. Due to the exponential growth in the number of possible states
of this Markov process, such analysis is typically restricted to small populations con-
sisting of short individuals. Nevertheless, it o
ers exact results and insights that are
otherwise not available. Markov chain-like analysis of individual operators has been
carried out by Goldberg and Segrest "173], Mahfoud "174], Horn "175] and Whitley
and Yoo "176].
Recently, Culberson "29] independently conceived of a crossover landscape. His
167
structure, which he calls a search space structure, is also a graph and the vertices
correspond to a population of points from f0 1gn . He examines crossovers between
two complementary strings which creates a graph that corresponds to the largest
connected component of the crossover landscape generated by 21!2 (two parents,
two o
spring, one crossover point) in the current model. That component (like all
others) is a hypercube. It contains vertices that correspond to all possible pairs of
binary strings of the form (a a). Culberson shows that the structure of the component
is isomorphic to the hypercube generated by the bit-ipping operator for strings of
length n ; 1. He shows the existence of 21!2-local-minima in this structure and
demonstrates how to transform a problem that appears hard for one operator into
a problem that appears hard for the other. This provides further evidence of the
importance of structure for search and of how that structure is induced by the choice of
operator. Wagner is also investigating isomorphisms between crossover and mutation
landscapes "177].
There is other related work whose connections to the landscape model have
not yet been well explored. This includes work done by Whitley and by Vose on
mixing matrices "178, 179], work by Altenberg "180], whose transmission functions for
crossover operators in 2!1 can be drawn as landscape graphs similar to that shown
in Figure 11 on page 54, and Helman's work on a general algebra for search problems
"140].
1. Kau
man ("117, page 564] and "118, page 61]) shows scatter plots of the t-
ness of local minima versus their distance from the best local minimum found,
for several NK landscapes. A number of local minima are discovered through
hillclimbing, and each is represented by a single point in the plot. These dia-
grams demonstrate the existence of what Kau
man calls a \Massif Central" in
NK landscapes with small K. By this it is meant that the highest peaks in the
landscape are near each other and this region occupies a very small region of
the entire search space. These scatter plots di
er from the plots used in FDC
only in the fact that each point in the plots corresponds to a peak on the land-
scape, not to a randomly chosen vertex in the landscape. Kau
man does not
compute correlation or other statistics about the relationship between distance
from (apparent) global optimum and tness of local optima.
2. Boese, Kahng and Muddu "116] produce similar scatter plots, but on a variety of
di
erent problems, and use them to motivate the construction of \multi-start"
hillclimbers that take advantage of the \big valley" structure they observe.
Their \big valley" is Kau
man's \Massif Central" inverted (Boese et al. study
minimization problems). They observe that the highest peaks (found by various
hillclimbers based on di
erent heuristics) are tightly clustered. They use this
knowledge to construct an \adaptive multi-start" hillclimber that takes uphill
steps in the same way the BH algorithm of x3.6(55) does. The hillclimber
uses the 2-opt operator introduced by Lin and Kernighan "182]. The algorithm
assumes that the problem at hand exhibits the big valley structure, and they
show how the hillclimber compares very favorably with other hillclimbers that
do not take advantage of this global structure. The hillclimber uses information
about high tness peaks located earlier in the search to select new starting
vertices on subsequent climbs. This technique is applied to 100- and 500-city
randomly generated symmetric traveling salesperson problems and to 100- and
150-vertex graph bisection problems. Boese "183] extends this work to consider
the 532-city traveling salesperson problem of Padberg and Rinaldi "184] and
169
mentions that similar plots have been obtained on circuit partitioning, graph
partitioning, number partitioning, satis ability, and job shop scheduling.
In addition to scatter plots showing tness of local optima against distance
to global optimum, Boese et al. produced plots of tness of local optima ver-
sus mean distance to other local optima. On the problems they have studied,
there has (apparently) been a single global optimum, rather than many, as with
some problems studied with FDC (which uses distance to the nearest global
optimum). Boese "183] produced similar plots on the 532-city problem using
local optima located via several hillclimbers, each using a di
erent \heuristic"
(\operator" in the language of this dissertation). The results for these di
erent
hillclimbers allow Boese to compute the size of the \big valley" for each op-
erator. The correlation between tness of local optima and distance to global
optimum was also computed and used to examine the relative strengths of the
relationship for each of the operators.
Boese et al. use a distance metric that is not the same as the distance de ned
by the operators they employ. There is no known polynomial time algorithm for
computing the minimumnumber of 2-opt moves between two circuits in a graph.
Instead, they use \bond" distance which is the number of vertices in the graph
minus the number of edges the two circuits have in common. Some analysis
shows that this distance is always within a factor of two of the 2-opt distance
between the circuits. This is similar to the use of Hamming distance instead of
\mutation distance" or \crossover distance" in the calculation of FDC.
Because the scatter plots of Boese et al. examine local optima, the relationship
that will be observed between tness and distance will depend on the diculty
of the problem and on the quality of the local optima that are examined. If a
good hillclimber is used to locate local optima on an easy problem, few local
optima may be found. In the worst case, on a non-pathological (in the style
of Horn et al. "86]) unimodal problem, only one peak will be located, and it
will be the global optimum. Boese et al. and Hagen and Kahng "185] have
observed instances where the relationship between tness and distance deteri-
orates and this appears dependent not only on the problem at hand, but on
the quality of the heuristic used to locate local optima. FDC does not run into
problems in these situations because it considers randomly chosen vertices of
the landscape|there is no requirement that they be local optima. Thus FDC
is useful for classifying easy problems as well as dicult ones. This is of little
consequence to Boese et al., who are motivated to take advantage of the big
170
valley structure in problems that contain many local optima. Naturally, the
pervasiveness of the \big valley" structure across problem types and instances
will determine how useful the adaptive multi-start algorithm is. Judging from
these initial investigations, the class of problems possessing this structural prop-
erty may quite large. If so, algorithms designed to exploit this structure, such
as adaptive multi-start, will be hard to beat.
3. Ruml, Ngo, Marks and Shieber have also produced a scatter plot of tness
against distance from a known optimum "186]. They examined an instance of
the number partitioning problem (of size 100) and compared points in several
representation spaces to the optimum obtained by Karmarkar and Karp "187]
(KK-optimum). Ruml et al. generate points by taking directed walks away
from the global optimum, at each step increasing the distance from the KK-
optimum by one. The points found on several such walks are then plotted as
tness (normalized by the tness of the KK-optimum) versus distance from the
KK-optimum. The resulting plots exhibit a \bumpy funnel" in which there is
a strong correspondence between solution quality and distance from the KK-
optimum. As with the observations of Boese et al. "116, 183], the high quality
solutions are tightly clustered around the best-known optimum and Ruml et
al. suggest that search algorithms that take advantage of this type of structure
(which they term \gravitation") will prove competitive. It is clear from these
results that the problem structure around the KK-optimum is qualitatively like
that around the optima studied by Boese et al.. Unlike Boese's results with
the 532-city TSP problem, we cannot be so con dent that the KK-optimum is
the global optimum as Karmarkar and Karp's algorithm exhibits a strong bias
towards equal-sized partitions. The relationship between tness and distance is
not quanti ed in any way|the scatter plots are used to give pictorial evidence
of the structure of the problem and \gravitation."
4. As mentioned in x1.3(8), Korf presented several heuristics for the 2 2 2
version of Rubik's cube "33]. Korf's graphs have heuristic value on the X axis
and the mean distance of all cube con gurations with a given heuristic value on
the Y axis. At rst glance, these graphs would seem to have little to do with
the FDC scatter plots of Chapter 5, but this is not the case. If the axes are
swapped and each point is replaced by the actual set of points that it represents,
the result is a scatter plot. As Korf points out, there will be virtually no
correlation between tness and distance in any of these plots. This replacement
and inversion of axes has not been done, but Korf's diagrams make it apparent
171
that if correlation is not zero, it can only be slightly negative (this is undesirable
in Korf's heuristic functions, which must be maximized).
Finally, there is a relationship between FDC and Weinberger's correlation length
"65]. Correlation length is a measure of correlation between tness values, but the au-
tocorrelation function, from which correlation length is calculated, also takes distance
into account. The autocorrelation function provides the correlation between tnesses
at all distances. The connection between this measure and FDC is the subject of cur-
rent investigation. In particular, if FDC proves useful for prediction using the method
suggested in x5.7.2(160), it will likely be prone to \error" in the way that correlation
length is when the tness landscape is not isotropic. When FDC is computed from
the point of view of a randomly chosen vertex (or peak), it, like correlation length,
will indicate that fully deceptive problems (for example) are easy. In practice there
may be nothing wrong with this, since if we ever encounter a reasonably large fully
deceptive problem, it would be a simple problem, as far as we would ever be able to
tell. The global maximum may as well not exist. A measure of problem diculty
that accords well with what we actually experience is likely to be far more useful than
some measure which insists that such a problem is actually very hard.
CHAPTER 7
Conclusions
In the introduction to this work, I wrote that the dissertation was a collection of very
simple questions and my attempts to answer them. It seems appropriate then that
a concluding chapter should present these questions and summarize (1) the answers
I have found, (2) the increase in understanding that the answers bring, and (3) the
practical bene ts that result from the understanding.
What is a landscape?
This is the most fundamental question of this dissertation. In the model of this
dissertation, a landscape is a labeled, directed, graph. The vertices of the graph
correspond to multisets of individuals and the edges correspond to the actions of
an operator. If an operator acts on an individual A, producing individual B with
probability p, then the landscape graph contains an edge from the vertex representing
A to the vertex representing B and that edge is labeled with p. It is important to
allow a vertex to correspond to a multiset of individuals, not just a single one. By
so doing, landscape graphs are well de ned for operators that act on and produce
multiple individuals. The vertices of a landscape graph are labeled with a value that
can be thought of as a height, and this gives rise to the landscape image. When
vertices correspond to single individuals, they are typically labeled with values from
a tness function. If not, the label is some function of the tnesses of the individuals
represented by the vertex. This, in essence, is the landscape model of this dissertation.
When evolutionary algorithms are viewed from the perspective of this model,
there is an important consequence. Because these algorithms are usually viewed as
making use of several operators, they are not operating on a single landscape, but
on many. As a result, there are mutation landscapes, crossover landscapes, selection
landscapes|every operator creates a landscape graph. Every time an algorithm uses
a particular operator, it can be seen as traversing an edge, or taking a \step," on
173
the landscape de ned by the operator in question. I call this the \One Operator,
One Landscape" consequence of the current model. If we intend to study and talk
about landscapes, my model insists that we pay attention to the \one operator, one
landscape" consequence. Naturally, we can study the results of using several oper-
ators in concert, but we can ask simpler questions|questions about the individual
landscapes.
There are other consequences of the model, and these are described in x2.7(36).
Two of these are the fact that landscapes may not be connected, and that they may
not be walkable.
The model of landscapes presented in this dissertation was designed to be useful
for thinking about evolutionary algorithms. Other models, designed for di
erent pur-
poses, have limitations that make them inappropriate for this purpose. The diculties
are that these other models do not (1) allow for algorithms that employ multiple op-
erators, (2) account for operator transition probabilities, or (3) admit operators that
act on or produce multiple individuals.
about evolutionary algorithms and landscapes, is nothing more than the perspec-
tive on search that has long been held in the Arti cial Intelligence community. For
example, production systems have been described as consisting of a \database," \pro-
duction rules," and a \control strategy." These correspond fairly well to our landscape
view of search with its vertices, edges (operators), and navigation strategy. Estab-
lishing links with the Arti cial Intelligence search community is a good reality check,
and, as will be seen, these links are more than super cial.
practical. The success of the crossover hillclimber leads to another simple question.
Is Crossover Useful?
This question has no simple answer because it is too general. It is tautological to
answer that crossover is useful if it manages to orchestrate the exchange of building
blocks between individuals. The question we should be asking is: \Given a particular
set of choices about algorithm, representation and tness function etc., does the use
of crossover produce bene ts that other, simpler, operators cannot produce?" If so,
there is a good argument for the use of crossover in that situation. A more complex
subsequent question could ask for the characteristics that these situations have in
common. In other words, as Eshelman and Scha
er asked, what is crossover's niche?
"89].
From a cursory inspection of the crossover hillclimbing algorithm, it appears that
the algorithm has isolated crossover from a genetic algorithm and harnessed its power
by searching using a more exploitative navigation strategy. New individuals are either
created completely at random or via the use of crossover. As it is easily shown that
the randomly created individuals are of predictably low quality, the credit for success
of the algorithm must go to crossover. This reasoning is correct, but it is far from
being the whole story.
Closer examination of the crossover hillclimbing algorithm reveals that its perfor-
mance correlates extremely well with the number of crossovers it performs in which
one individual is randomly created (x3.10(72)). But crossovers involving one ran-
dom individual are nothing more than macromutations of the non-random individual,
where the mutations are distributed in the fashion of the crossover operator in ques-
tion. The conclusion that crossover hillclimbing was successful because it was making
macromutations was con rmed when an extreme version of the algorithm, in which
every crossover involved one random individual, proved better than all previous ver-
sions of the algorithm. This leads to a distinction between the idea and the mechanics
of crossover. The idea is what crossover is trying to achieve (the recombination of
above average genetic material from parents of above average tness) and the me-
chanics is the method by which this is attempted. All crossover operators share the
same basic idea, but their mechanics vary considerably. The crossover hillclimbing
algorithm is a clear demonstration that crossover can be used e
ectively for search
even when it is only the mechanics that are doing the work.
176
and landscape approximate this ideal, the easier search will be on the landscape.
A measure of this, Fitness Distance Correlation (FDC), is de ned in Chapter 5.
The application of FDC to approximately 20 problems from the literature on genetic
algorithms shows it to be a very reliable measure of diculty for the genetic algorithm.
Correlation is only one summary of the relationship between tness and distance from
the goal of the search, and in some cases it is too simplistic. Scatter plots of sampled
vertices on a landscape often reveal structure not detected by correlation. These
plots are one way of looking for structure in a landscape graph and they are often
surprisingly revealing. Apart from correctly classifying many problems correctly,
including ones that gave surprising results when rst studied. FDC also predicted
that the question of whether or not Gray coding was bene cial to a genetic algorithm
was dependent on the number of bits used in the encoding. The accuracy of this
prediction was later con rmed experimentally.
FDC captures a property of a landscape graph. It has nothing to do with any
search algorithm beyond the choices that determine the landscape. As a result, FDC
can indicate that a problem should be easy, but an algorithm, because of its navigation
strategy, may nd the problem hard. The indications of FDC about a problem can
be interpreted as a rough measure of how hard a problem should be. If FDC says a
problem is easy but an algorithm does not nd it so, this may be taken as a sign that
this choice of algorithm (or algorithm parameters) is unusually bad for this problem.
As an indicator of genetic algorithm performance, FDC has proved very reliable.
APPENDIX A
exploration of the landscape. A breadth- rst exploration based on distance from the
original vertex faces the problem that a vertex v1 at distance d1 might not appear to
be in the basin of attraction of the original vertex until the depth- rst search reaches
some vertex v2 at distance d2 > d1 and discovers that vertex v1 has some probability
of ascending to v2. Such an algorithm would be faced with storage problems similar
to that of the rst implementation.
The breadth- rst approach can be used however if the search proceeds breadth-
rst according to tness rather than breadth- rst according to distance from the
initial vertex. If all the tter neighbors of a vertex v1 have been completely processed
by the algorithm, and have all updated their downhill neighbors that could re-ascend,
then the vertex v1 can never again be encountered. Once the downhill neighbors of
the vertex v1 are updated according to their probabilities of reaching v1, the statistics
for v1 can be written to a le (if these are to even be recorded) and the memory
allocated to v1 can be freed. This idea has several advantages:
Vertices in the basin will be encountered multiple times, but will only be de-
scended from once. In the previous implementations, each time a vertex was
reached, its basin of attraction would be calculated afresh. This provides a
signi cant increase in speed.
This implementation returns memory to the operating system as it does not need
to keep the entire basin of attraction in memory. There is a simple condition
that indicates when a vertex has been completely dealt with, and at that time
its summary statistics can be gathered and the vertex discarded from memory.
The peak memory load is typically 15 to 20 percent less than that required by
the rst implementation.
The vertices in the basin of attraction are found in order of decreasing tness.
This allows the ecient delineation of some fraction of the upper part of the
basin of attraction. The earlier implementations also provide a method of ap-
proximation, as recursive calls can be made only if the probability of ascent (or
the tness) has not fallen below some lower limit.
This method provides a large speed increase and a moderate saving in memory.
Its main drawback is the memory requirement, but it has been used to nd basins
containing over half a million vertices.
The data structures used to implement this version are naturally more involved.
The two most important operations are the fast lookup of a vertex based on its identity
and a fast method for retrieving the vertex with highest tness from those that still
183
need to be processed. The rst task is best accomplished using a hash table and the
second using a priority queue (implemented as a heap). I implemented a combination
hash table and priority queue to solve both problems at once. The elements of the
hash table contain a pointer into the priority queue and vice-versa. This allows O(1)
expected vertex identity comparisons to retrieve a vertex given its identity and O(1)
worst-case retrieval of the unprocessed vertex with the highest tness (the priority
queue retrieves a pointer into the hash table of the highest tness vertex in constant
time, and this is dereferenced in constant time since it points directly to the stored
element and does not require hashing). Insertion of a new vertex requires insertion
into both the hash table and the priority queue and this requires O(lg n) comparisons
(worst and average case) where n is the number of elements currently in the priority
queue. Deletion requires O(lg n) comparisons (worst and average case) for similar
reasons.
A somewhat simpli ed pseudo-code summary of this implementation is shown
in Algorithm A.1. The pseudo-code assumes that the original point has already been
inserted into both the hash table and the priority queue. Details of the coordination
of pointers from the hash table to the priority queue and vice-versa are not shown.
For more eciency, the hash table nd and insert can be done at the same time if
the vertex is not present.
The above solution should not be taken as being in any way optimal. It is simply
better than two others that are less sophisticated. A more ecient solution might
employ a better mix of memory requirement and post-processing. If the hash table
were allowed to contain only some maximum number of vertices, incomplete vertices
(probably those that were judged most complete|for instance those with the highest
ascent probabilities) could be written to a le for later coalescing, as in the second
implementation. If a good balance could be found (and there is no reason to think
this would not be straightforward, e.g., do as much in memory as you can a
ord),
this would cap memory requirements and hopefully not demand too much disk space
or post-processing time.
184
APPENDIX B
Balanced Hillclimbing
On the six landscapes examined in Chapter 4, SA had the highest probability
of locating a peak on a single hillclimb on all of them. AA was always been second.
Despite this, on ve of the six problems, AA is the algorithm of choice since it ex-
pects to perform fewer evaluations per uphill step than SA does, and this advantage
outweighs SA's greater location probability. It was always clear that, all other things
being equal, making fewer evaluations per step rather than more would result in a
better algorithm. Reverse hillclimbing shows that, at least on the instances of the
problems considered in Chapter 4, all other things being equal, being more exploita-
tive rather than less will also produce a better algorithm. Since SA and AA represent
two extremes when it comes to these choices in hillclimber design, it is reasonable to
expect that a more balanced algorithm might perform better than either of them.
This line of reasoning leads to the idea of hillclimbers that spend some amount
of time identifying uphill neighbors and then move to the highest of those found.
If we restrict attention to those hillclimbers that do not restart unless they have
examined all neighbors and found no improvement, these algorithms can be thought
of as members of a family of hillclimbers that make use of a function called another.
The another function returns a boolean value which is used to decide whether another
uphill neighbor should be sought. The arguments to the function are the number of
neighbors (n), the number of uphill directions located so far (u), and the number
of neighbors that have been tried (t). The assumption that these hillclimbers will
always continue to search for an uphill neighbor until the rst is found is equivalent
to saying that another (n 0 t) = true for all algorithms in this family. Similarly, we
can assume that the another function will not be called if t = n, i.e., if all neighbors
have been tried.
Both AA and SA are members of this family (LA and MA are not since they do
not, in general, move to the steepest of the uphill neighbors). In AA, the another
function always returns false, never continuing the search for uphill neighbors after
the rst has been located. In SA, the another function always returns true, since
SA requires the examination of all neighbors. Within this family of hillclimbers, it is
easy to see the sense in which AA and SA are the extreme algorithms.
There are several obvious alternatives for the another function. A very simple
one is a function that returns true only if u < k for some k (a constant or a function
186
of n). This might be called Best-of-k hillclimbing since the resulting algorithm nds
k uphill neighbors (if possible) and moves to the best of these. I will denote this
algorithm by BO-k. When k = 1, this algorithm becomes AA and when k = n it
becomes SA. AA will, on average, choose the median of all the uphill neighbors for
a certain amount of work. On average, BO-2 will do twice as much work but will
only choose an uphill neighbor whose rank is 1/3rd amongst all uphill neighbors. A
rough estimate argues that twice as much work nding uphill neighbors will not be
adequately compensated for by only receiving a one-sixth increase in the ranking of
the uphill neighbor chosen.
Another strategy attempts to nd another uphill neighbor only if less than k
neighbors have been examined for some k (a constant or a function of n). That is,
the another function returns true if t < k. This algorithm also degenerates to AA
and SA if k = 1 or k = n. When k = 2 the algorithm is very similar to AA, and
can be thought of as identical to AA, except it seeks a second neighbor only on those
occasions when AA found an uphill neighbor on the rst trial. If k is small compared
to n, this algorithm can be thought of as trying for a second neighbor when AA gets
lucky. This algorithm will be called Try-k hillclimbing.
Table 33. The another function for several hillclimbers. n is the number of neighbors,
u is the number of uphill directions located so far, and t is the number of neighbors
tried so far.
bors, then return false. This hillclimber attempts to nd many uphill neighbors, but
stops looking once it appears likely that no more exist. It is similar to SA, but should
use fewer evaluations, at the risk of missing uphill directions. For example, if only
a single uphill direction has been found and more than half the neighbors have been
examined, the another function of this algorithm would return false. If two uphill
neighbors have been found and more than two-thirds of the neighbors have been ex-
amined, it returns false etc. The another function returns true if n ; t t=u. This
will be called predictive ascent hillclimbing, and abbreviated PA. A similar algorithm
that is more pessimistic, denoted Pess, returns true if n ; t 2t=u. The another
functions for AA, SA, and the above algorithms are summarized in Table 33.
A graphical interpretation of these functions is shown in Figure 107. Any par-
titioning of such graphs into true and false regions represents a hillclimbing algo-
rithm. These graphs also illustrate what is meant by the claim that AA and SA are
in some sense algorithms at the two extremes of another functions.
Tables 34 to 39 show the result of hillclimbing with AA, SA, BO-2, PA, Pess,
Table 34. The results from 10,000 hillclimbs on a 16,12 NK landscape. The table
shows, for eight hillclimbers, how frequently the best 5% of 1,000 randomly located
peaks were located, and the mean and standard deviation of the number of evaluations
to do so.
Try-2, Try-4 and Try-n=2 on the NK landscapes and busy beaver problems studied in
Chapter 4. A y sign is used to indicate occasions when one of the six new hillclimbers
has a mean number of evaluations per peak that is less than both AA and SA.
The main conclusion that can be drawn from these tables is that AA is dicult
to beat. Apart from one instance in which Pess beats AA (2-state busy beaver),
the only hillclimbers to occasionally beat AA are the Try-n hillclimbers. There is a
correspondence between the diculty of the problem and how Try-2, Try-4 and Try-
n=2 compare to AA. As problems get simpler, the performance of Try-n increases
as n does. As problems get harder, the expected number of uphill directions can be
188
TRUE FALSE
U
P 1 1
H 1 N 1 N
I
L
L
Best of K Try K
N N
N
E
I FALSE
G
TRUE FALSE
H
B K
O
TRUE
R
S 1 1
1 N 1 K N
F
O
Predictive Ascent Pessimistic Ascent
U N N
N
D
TRUE TRUE
1 FALSE 1 FALSE
1 N/2 N 1 N/3 N
Figure 107. A graphical interpretation of six another functions. The X axis displays
the number of neighbors examined and the Y axis the number of uphill neighbors
found. The true and false regions indicate what the another function will return.
All axes have a minimum of one and a maximum of n (the number of neighbors).
189
expected to fall and so an algorithm that spends less time looking for things that
probably do not exist will tend to perform better. It should be remembered that
AA is actually Try-1. Try-2 is very similar to AA|it only looks for another uphill
direction if one was found on the very rst neighbor examined. As the number of
uphill neighbors increases, spending more time looking for them may be worthwhile.
For this reason, the Try-n hillclimbers tend to do better on easy problems when n is
greater.
Table 35. The results from 10,000 hillclimbs on a 16,8 NK landscape. The table
shows, for eight hillclimbers, how frequently the best 5% of 700 randomly located
peaks were located, and the mean and standard deviation of the number of evaluations
to do so.
Table 36. The results from 10,000 hillclimbs on a 16,4 NK landscape. The table
shows, for eight hillclimbers, how frequently the best 5% of the 180 peaks were located,
and the mean and standard deviation of the number of evaluations to do so.
Table 37. The results from 40 million hillclimbs on the 4-state busy beaver problem.
The table shows, for eight hillclimbers, how frequently an optimal Turing machine
was located, and the mean and standard deviation of the number of evaluations to
do so (rounded to the nearest million).
Table 38. The results from 100,000 hillclimbs on the 3-state busy beaver problem.
The table shows, for eight hillclimbers, how frequently an optimal Turing machine
was located, and the mean and standard deviation of the number of evaluations to
do so.
AA SA BO-2 PA Pess Try-2 Try-4 Try-n=2
Peaks found 397 444 424 435 411 430 433 438
Evals/peak 12,670 17,732 16,607 14,527 14,013 11 761y 11 903y 13,694
Evals/peak s.d. 13,048 16,977 16,165 13,967 14,090 10,494 10,942 13,507
Table 39. The results from 100,000 hillclimbs on the 2-state busy beaver problem.
The table shows, for eight hillclimbers, how frequently an optimal Turing machine
was located, and the mean and standard deviation of the number of evaluations to
do so.
APPENDIX C
Table 40. The standard errors for CH, GA-S, GA-E and BH on the one max problem
with 60 bits for the data graphed in Figure 13 on page 67.
33 36 39 42 45 48 51 54 57 60
CH 016 032 053 085 134 208 338 563 104 340
GA-S 005 019 106 749 129 139 160 192 246 446
GA-E 004 017 098 651 108 121 140 166 206 353
BH 010 012 014 015 017 020 024 031 043 245
Table 41. The standard errors for CH, GA-S, GA-E and BH on the one max problem
with 120 bits for the data graphed in Figure 14 on page 67.
66 72 78 84 90 96 102 108 114 120
CH 037 074 136 245 441 754 128 228 452 1014
GA-S 013 130 145 208 246 297 389 581 901
GA-E 008 077 880 126 146 180 230 326 824
BH 022 027 030 033 036 043 050 064 093 762
192
Table 42. The standard errors for CH, GA-S, GA-E and BH on the fully easy problem
with 10 subproblems for the data graphed in Figure 15 on page 69.
1 2 3 4 5 6 7 8 9 10
CH 065 379 846 138 202 297 460 927 301 2137
GA-S 006 102 164 486 850 202 753 5099 92126 10E7
GA-E 006 101 160 429 689 118 335 1435 14537 43E5
BH 102 324 127 800 614 9213
Table 43. The standard errors for CH, GA-S, GA-E and BH on the fully easy problem
with 15 subproblems for the data graphed in Figure 16 on page 69.
6 7 8 9 10 11 12 13 14 15
CH 275 354 462 738 152 401 1494 9610 14E5 50E6
GA-S 117 320 1451 11743 23E5 15E7
GA-E 121 303 1028 5838 53218 10E6
BH 373 2897 36718
Table 44. The standard errors for CH, GA-S, GA-E and BH on the fully deceptive
problem with 10 subproblems for the data graphed in Figure 17 on page 70.
1 2 3 4 5 6 7 8 9 10
CH 085 371 158 470 101 227 774 4710 69912 39E6
GA-S 006 100 105 441 201 761 4164 47627 11E6
GA-E 006 100 909 365 152 488 2160 14596 22E5
BH 202 491 20146
193
Table 45. The standard errors for CH, GA-S, GA-E and BH on the fully deceptive
problem with 15 subproblems for the data graphed in Figure 18 on page 70.
3 4 5 6 7 8 9 10 11 12
CH 510 165 418 870 173 384 1222 5585 46369 49E5
GA-S 563 135 328 102 264 718 2826 19927 43E5
GA-E 559 126 325 815 199 488 1438 6489 47247 66E5
BH 6213
Table 46. The standard errors for CH, GA-S, GA-E and BH on the distributed fully
deceptive problem with 10 subproblems for the data graphed in Figure 19 on page 71.
1 2 3 4 5 6
CH 030 100 296 7479 38E5
GA-S 004 077 182 17368 31E6
GA-E 006 101 194 7705 28E5 72E6
BH 213 510 23629
Table 47. The standard errors for CH, GA-S, GA-E and BH on the distributed fully
deceptive problem with 15 subproblems for the data graphed in Figure 20 on page 71.
1 2 3 4 5 6 7
CH 015 192 652 682 6552 12E5 22E6
GA-S 004 044 203 1208 76983 61E6
GA-E 004 044 181 665 10004 13E5 37E6
BH 180 248 4167
194
Table 48. The standard errors for CH, GA-S, GA-E and BH on the busy beaver
problem with 3 states for the data graphed in Figure 21 on page 72.
0 1 2 3 4 5 6
CH 005 010 038 124 365 104 154
GA-S 001 002 015 177 130 487 787
GA-E 001 003 016 186 124 397 616
BH 016 018 071 117 151 121 126
Table 49. The standard errors for CH, GA-S, GA-E and BH on the busy beaver
problem with 4 states for the data graphed in Figure 22 on page 72.
4 5 6 7 8 9 10 11 12 13
CH 193 496 112 349 131 917 12930 14E5 13E6 59E6
GA-S 531 195 500 257 2201 43098 33E5 60E6 11E7
GA-E 659 224 503 167 951 16545 21E5 24E6 81E6
BH 103 158 466 168 784 1003 25372 24E5 14E6
Table 50. The standard errors for CH, GA-S, GA-E and BH on Holland's royal road
problem with k = 4 for the data graphed in Figure 23 on page 73.
1 2 3 4
CH 323 222 403 18E5
GA-S 015 375 2070
GA-E 015 343 1321 31E6
BH 2345
195
Table 51. The standard errors for CH, GA-S, GA-E and BH on Holland's royal road
problem with k = 6 for the data graphed in Figure 24 on page 73.
1 2 3
CH 086 338 38849
GA-S 003 290 13786
GA-E 003 315 12262
BH 359 74E5
Table 52. The standard errors for CH, CH-1S and CH-NJ on the one max problem
with 120 bits for the data graphed in Figure 25 on page 75.
66 72 78 84 90 96 102 108 114 120
CH 037 074 136 245 441 754 128 228 452 1014
CH-1S 031 080 181 352 619 104 194 345 691 627
CH-NJ 157 873 257
Table 53. The standard errors for CH, CH-1S and CH-NJ on the fully easy problem
with 15 subproblems for the data graphed in Figure 26 on page 75.
6 7 8 9 10 11 12 13 14 15
CH 275 354 462 738 152 401 1494 9610 14E5 50E6
CH-1S 284 367 480 659 975 164 334 879 3385 35004
CH-NJ
196
Table 54. The standard errors for CH, CH-1S and CH-NJ on the fully deceptive
problem with 15 subproblems for the data graphed in Figure 27 on page 75.
5 6 7 8 9 10 11 12 13 14
CH 418 870 173 384 1222 5585 46369 49E5
CH-1S 421 863 161 312 708 2090 7764 41828 38E5 53E6
CH-NJ
Table 55. The standard errors for CH, CH-1S and CH-NJ on the distributed fully
deceptive problem with 15 subproblems for the data graphed in Figure 28 on page 75.
1 2 3 4 5 6 7
CH 015 192 652 682 6552 12E5 22E6
CH-1S 011 147 605 757 7573 97328 18E6
CH-NJ 024 251 508 1068
Table 56. The standard errors for CH, CH-1S and CH-NJ on the busy beaver problem
with 4 states for the data graphed in Figure 29 on page 76.
4 5 6 7 8 9 10 11 12 13
CH 193 496 112 349 131 917 12930 14E5 13E6 59E6
CH-1S 192 536 117 364 136 964 15697 12E5 56E5 17E6
CH-NJ 782 7718 10E5
Table 57. The standard errors for CH, CH-1S and CH-NJ on Holland's royal road
problem with k = 4 for the data graphed in Figure 30 on page 76.
1 2 3 4
CH 323 222 403 18E5
CH-1S 193 211 258 14619
CH-NJ 635 14063
197
Table 58. The standard errors for GA-S and GA-RC on the one max problem with
120 bits for the data graphed in Figure 32 on page 79.
72 77 82 87 92 97 102 107 112 117
GA-S 130 115 202 232 262 305 389 530 108 46E5
GA-RC 091 102 550 2641 56E6
Table 59. The standard errors for GA-S and GA-RC on the fully easy problem with
15 subproblems for the data graphed in Figure 33 on page 79.
2 3 4 5 6 7 8 9 10 11
GA-S 043 811 318 628 117 320 1451 11743 23E5 15E7
GA-RC 042 856 860 2762 36E5
Table 60. The standard errors for GA-S and GA-RC on the fully deceptive problem
with 15 subproblems for the data graphed in Figure 34 on page 79.
2 3 4 5 6 7 8 9 10 11
GA-S 044 563 135 328 102 264 718 2826 19927 43E5
GA-RC 043 674 386 616 52192
Table 61. The standard errors for GA-S and GA-RC on the distributed fully de-
ceptive problem with 15 subproblems for the data graphed in Figure 35 on page 79.
1 2 3 4 5 6
GA-S 004 044 203 1208 76983 61E6
GA-RC 004 044 706 124 7444 85E5
198
Table 62. The standard errors for GA-S and GA-RC on the busy beaver problem
with 4 states for the data graphed in Figure 36 on page 80.
3 4 5 6 7 8 9 10 11 12
GA-S 068 531 195 500 257 2201 43098 33E5 60E6 11E7
GA-RC 087 687 390 285 4407 50396 71E5 71E6
Table 63. The standard errors for GA-S and GA-RC on Holland's royal road problem
with k = 4 for the data graphed in Figure 37 on page 80.
1 2 3
GA-S 015 375 2070
GA-RC 016 657
Table 64. The standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the one
max problem with 120 bits for the data graphed in Figure 38 on page 82.
66 72 78 84 90 96 102 108 114 120
CH-1S 031 080 181 352 619 104 194 345 691 627
GA-E 008 077 880 126 146 180 230 326 824
BH-MM 011 022 046 092 183 343 603 106 217 111
BH-DMM 012 021 037 066 115 205 359 629 133 695
199
Table 65. The standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the fully
easy problem with 15 subproblems for the data graphed in Figure 39 on page 82.
6 7 8 9 10 11 12 13 14 15
CH-1S 284 367 480 659 975 164 334 879 3385 35004
GA-E 121 303 1028 5838 53218 10E6
BH-MM 288 371 477 616 812 110 159 278 666 2630
BH-DMM 19320 17E5
Table 66. The standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
fully deceptive problem with 15 subproblems for the data graphed in Figure 40 on
page 83.
6 7 8 9 10 11 12 13 14 15
CH-1S 863 161 312 708 2090 7764 41828 38E5 53E6
GA-E 815 199 488 1438 6489 47247 66E5
BH-MM 988 163 257 399 727 1445 3909 13855 78320 82E5
BH-DMM
Table 67. The standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
distributed fully deceptive problem with 15 subproblems for the data graphed in
Figure 41 on page 83.
1 2 3 4 5 6 7
CH-1S 011 147 605 757 7573 97328 18E6
GA-E 004 044 181 665 10004 13E5 37E6
BH-MM 042 881 383 2028 14548 19E5
BH-DMM 177 345 1106 5213 28631
200
Table 68. The standard errors for CH-1S, GA-E, BH-MM and BH-DMM on the
busy beaver problem with 4 states for the data graphed in Figure 42 on page 83.
4 5 6 7 8 9 10 11 12 13
CH-1S 192 536 117 364 136 964 15697 12E5 56E5 17E6
GA-E 659 224 503 167 951 16545 21E5 24E6 81E6
BH-MM 089 223 507 174 784 528 6948 10E5 63E5 20E6
BH-DMM 177 436 100 367 150 890 12761 19E5 92E5
Table 69. The standard errors for CH-1S, GA-E, BH-MM and BH-DMM on Hol-
land's royal road problem with k = 4 for the data graphed in Figure 43 on page 83.
1 2 3 4 5
CH-1S 193 211 258 14619
GA-E 015 343 1321 31E6
BH-MM 066 119 934 911 29560
BH-DMM 642 452
201
APPENDIX D
The set of ( tness, distance) pairs Sk is obtained when k copies of the base set are
concatenated. Henceforth, we will abandon mention of tness and distance, since the
concepts are irrelevant to the proof. Clearly, S1 = (X (1) Y (1)) = S = (X Y ). X and
Y will generally be used in preference to X (1) and Y (1). The notation Xi(k) will be
used to represent a general element of X (k) . A simple counting argument shows that
jX (k)j = jY (k)j = nk . The correlation r(P ) of any set of ordered pairs P = (A B ) is
given by
r(P ) =
CA B
A
B
where X
CA B = n1 (Ai ; A)(Bi ; B )
i
is the covariance of A and B , A and B are the means of A and B , and
v
u Pn (A ; A)2 v
u Pn (B ; B )2
u
t i i
u
t i=1 i
A = =1
n and
B = n
are the standard deviations of A and B .1
1 The denominator in the formula for standard deviation is usually given as n ; 1, not n. This is
202
because the value of the mean that is used in the formula is an estimate of the true mean. When the
actual mean of the underlying distribution is known (as in the computation of FDC by examination
of the entire representation space), n replaces n ; 1 in the computation of standard deviation 192,
page 19].
203
X n k kXnk kY
= Xi Yi ;
(k) (k)
nk
i
X (k) (k) k 2
= Xi Yi ; n k X Y
i
X (k) (k) 2
= (Xi Yi ; k X Y )
i h i
X (k;1)
= (Xi + Xj )(Yi(k;1) + Yj ) ; k2X Y
i j
X h (k;1) (k;1) i
= Xi Yi + Yj Xi(k;1) + Xj Yi(k;1) + Xj Yj ; k2X Y
i j
(k;1) (k;1)
=
X
i j
X Y i ; (k ; 1)2X Y + X Yj X X (k;1)
i
j i
i
X X X X
+ Xj Yi(k;1) + Xj Yj ; (2k ; 1)X Y
X j (k;1) i (k;1) (i jk;1) (k;1) i j
= (Xi Yi ; X Y ) + nY nk;1 (k ; 1)X
i j
X
+ nXnk;1(k ; 1)Y + nk;1 XiYi ; nk (2k ; 1)X Y
X (k;1) (k;1) (k;1) (ik;1)
= (Xi Yi ; X Y ) + 2nk (k ; 1)X Y
i j
X
+ nk;1 Xi Yi ; nk (2k ; 1)X Y
X (k;1)i (k;1) (k;1) (k;1) k
= (Xi Yi ; X Y ) + n X Y (2(k ; 1) ; (2k ; 1))
i j
X
+ nk;1 Xi Yi
X (k;1)i (k;1) (k;1) (k;1) k X
= (Xi Yi ; X Y ) ; n X Y + nk;1 Xi Yi
i j i
X (k;1) (k;1) (k;1) (k;1) k;1 X
= n (Xi Yi ; X Y ) + n ( XiYi ; nX Y )
Xi (k;1) (k;1) (k;1) (k;1) i
= n (Xi Yi ; X Y )
i PXPY
X
+ n ( XiYi ; i ni i i )
k ; 1
We now examine the sum in the rst term of (3), Pi(Xi(k;1)Yi(k;1) ; X (k;1) Y (k;1)).
205
If we let w = k ; 1, then
X X X
(Xi(w)Yi(w) ; X (w) Y (w)) = Xi(w)Yi(w) ; X (w) Y (w)
i
Xi i
= ;
Xi Yi(w) (w )
w nw X Y 2
Xi w w
= Xi Yi ; w nw X Y ; w nw X Y + w nw X Y
( ) ( ) 2 2 2
Xi w w
= Xi Yi ; wY wnw X ; wXwnw Y + nw wXwY
( ) ( )
Xi w w X X
= Xi Yi ; wY Xi w ; wX Yi w
( ) ( ) ( ) ( )
i i i
X
+ X (w) Y (w)
i
X
= (Xi(w)Yi(w) ; wY Xi(w) ; wXYi(w) + X (w) Y (w))
Xi
= (Xi(w)Yi(w) ; Y (w)Xi(w) ; X (w)Yi(w) + X (w) Y (w))
Xi
= (Xi(w) ; X (w) )(Yi(w) ; Y (w)) (4)
i
Replacing w by k ; 1 and substituting (4) back into (3), we have
X
nk CX (k) Y (k) = n (Xi(k;1) ; X (k;1))(Yi(k;1) ; Y (k;1))
i X
+ nk;1 (Xi ; X )(Yi ; Y )
i
P (X (k;1) ; X (k;1))(Y (k;1) ; Y (k;1)) P (X ; X )(Y ; Y )
CX (k) Y (k) = i i i + i i i
nk ; 1 n
= CX (k;1) Y (k;1) + CX Y
= kCX Y
This proves (1), the rst part of the main theorem.
This will be proved by examining Pi (Xi(k) ; X (k)) from which
X (and similarly
2
i i
206
Xh i2
= (Xi(k;1) ; (k ; 1)X ) + (Xj ; X )
i j
X X
(Xi(k;1) ; (k ; 1)X ) + 2 (Xi(k;1) ; (k ; 1)X )(Xj ; X )
2
=
i j i j
X
+ (Xj ; X ) 2
(5)
i j
Consider the middle term of (5):
X hX X
2 (Xi(k;1) ; (k ; 1)X )(Xj ; X ) = 2 Xi(k;1)Xj ; Xi(k;1)X
i j i j i j
i
; i j (k ; 1)XXj + X
X
i j
(k ; 1)X 2
h X k; X X
= 2 Xi ( 1)
Xj ; nX Xi k; ( 1)
i j i
; (k ; 1)n X j Xj + (k ; 1)nkX i
k ; X 1 2
h
= 2 (k ; 1)nk; XnX ; (k ; 1)nk X
1 2
; (k ; 1)nk X + (k ; 1)nk X i
2 2
= 0
Therefore (5) becomes
X X X
(Xi(k) ; X (k)) = (Xi(k;1) ; (k ; 1)X ) + (Xj ; X )2
2 2
i i j i j
X
= n
i
(Xi ;
(k 1)
; (k ; 1)X ) + nk; Xj (Xj ; X )
2 1 2
X
= n
i
(Xi(k;1) ; X k; ) + nk; X (Xj ; X )
( 1)
2 1 2
8
>< nT + nk;1T k > 1
Tk = > P k;1 1
: i (Xi ; X )2 k = 1:
Small values of k suggest that Tk = knk;1 Pi (Xi ; X )2 which is quickly con rmed
207
REFERENCES
"61] K. Mathias and L. D. Whitley. Genetic operators, the tness landscape and
the traveling salesman problem. In R. Manner and B. Manderick, editors,
Parallel Problem Solving From Nature, volume 2, pages 219{228, Amsterdam,
The Netherlands, 1992. Elsevier Science Publishers B.V.
"62] P. F. Stadler and W. Schnabl. The landscape of the traveling salesman problem.
Physics Letters A, 161:337{344, 1992.
"63] E. D. Weinberger. Fourier and taylor series on tness landscapes. Biological
Cybernetics, 65:321{330, 1990.
"64] E. D. Weinberger. Measuring correlations in energy landscapes and why it mat-
ters. In H. Atmanspacher and H. Scheingraber, editors, Information Dynamics,
pages 185{193. Plenum Press, New York, 1991.
"65] E. D. Weinberger. Correlated and uncorrelated tness landscapes and how to
tell the di
erence. Biological Cybernetics, 63:325{336, 1990.
"66] R. Parsons, S. Forrest, and C. Burks. Genetic operators for the DNA fragment
assembly problem. Machine Learning, 1995. (in press).
"67] N. J. Radcli
e and P. D. Surry. Fitness variance of formae and performance
prediction. In L. D. Whitley and M. D. Vose, editors, Foundations of Genetic
Algorithms, volume 3, San Mateo, CA, 1995. Morgan Kaufmann.
"68] D. E. Goldberg and R. Lingle. Alleles, loci, and the traveling salesman problem.
In J. J. Grefenstette, editor, Proceedings of an International Conference on
Genetic Algorithms and their Applications, pages 154{159. Lawrence Erlbaum,
Hillsdale, NJ, 24{26 July 1985.
"69] T. Starkweather, S. McDaniel, K. Mathias, and L. D. Whitley. A comparison
of genetic sequencing operators. In R. K. Belew and L. B. Booker, editors, Pro-
ceedings of the Fourth International Conference on Genetic Algorithms, pages
69{76, San Mateo, CA, 1991. Morgan Kaufmann.
"70] J. Dzubera and L. D. Whitley. Advanced correlation analysis of operators for
the traveling salesman problem. In Y. Davidor, H.-P. Schwefel, and R. Manner,
editors, Parallel Problem Solving From Nature { PPSN III, volume 866 of Lec-
ture Notes in Computer Science, pages 68{77, Berlin, 1994. Springer-Verlag.
"71] M. Gorges-Schleuter. ASPARAGOS An asynchronous parallel genetic optimiza-
tion strategy. In J. D. Scha
er, editor, Proceedings of the Third International
214
Conference on Genetic Algorithms, pages 422{427, San Mateo, CA, June 4{7
1989. Morgan Kaufmann.
"72] H. Muhlenbein. Evolution in time and space { The parallel genetic algorithm.
In G. J. E. Rawlins, editor, Foundations of Genetic Algorithms, volume 1, pages
316{337, San Mateo, CA, 1991. Morgan Kaufmann.
"73] P. Moscato and M. G. Norman. A \memetic" approach for the travelling sales-
man problem|implementation of a computational ecology for combinatorial
optimisation on message-passing systems. In Proceedings of the International
Conference on Parallel Computing and Transputer Applications, Amsterdam,
1992. IOS Press.
"74] Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs.
Springer-Verlag, New York, 2nd edition, 1994.
"75] K. A. De Jong. Genetic algorithms A 10 year perspective. In J. J. Grefenstette,
editor, Proceedings of an International Conference on Genetic Algorithms and
their Applications, pages 169{177, Hillsdale, NJ, 24{26 July 1985. Carnegie
Mellon University, Lawrence Erlbaum.
"76] S. Kirkpatrick. Optimization by simulated annealing: Quantitative studies.
Journal of Statistical Physics, 34:975{986, 1984.
"77] J. J. Hop eld and D. Tank. Neural computation of decisions in optimization
problems. Biological Cybernetics, 52:141{152, 1985.
"78] J. T. Richardson, M. R. Palmer, G. Liepins, and M. Hilliard. Some guidlines for
genetic algorithms with penalty functions. In J. D. Scha
er, editor, Proceedings
of the Third International Conference on Genetic Algorithms, pages 191{197,
San Mateo, CA, June 4{7 1989. Morgan Kaufmann.
"79] W. Siedlecki and J. Sklansky. Constrained genetic optimization via dynamic
reward-penalty balancing and its use in pattern recognition. In J. D. Schaf-
fer, editor, Proceedings of the Third International Conference on Genetic Algo-
rithms, pages 141{150, San Mateo, CA, June 4{7 1989. Morgan Kaufmann.
"80] L. Davis and M. Steenstrup. Genetic algorithms and simulated annealing: An
overview. In L. Davis, editor, Genetic Algorithms and Simulated Annealing,
pages 1{11. Pitman (Morgan Kaufmann), London, 1987.
215
"113] J. R. Levenick. Inserting introns improves genetic algorithm success rate: Tak-
ing a cue from biology. In R. K. Belew and L. B. Booker, editors, Proceedings
of the Fourth International Conference on Genetic Algorithms, pages 123{127,
San Mateo, CA, 1991. Morgan Kaufmann.
"114] R. G. Palmer and C. M. Pond. Internal eld distributions in model spin glasses.
Journal of Physics F, 9(7):1451{1459, 1979.
"115] K. A. De Jong, W. M. Spears, and D. F. Gordon. Using Markov chains to
analyze GAFOs. In L. D. Whitley and M. D. Vose, editors, Foundations of
Genetic Algorithms, volume 3, San Mateo, CA, 1995. Morgan Kaufmann. (To
appear).
"116] K. D. Boese, A. B. Kahng, and S. Muddu. A new adaptive multi-start technique
for combinatorial global optimizations. Operations Research Letters, 16(2):101{
113, September 1994.
"117] S. A. Kau
man. Adaptation on rugged tness landscapes. In D. Stein, editor,
Lectures in the Sciences of Complexity, volume 1, pages 527{618. Addison-
Wesley Longman, 1989.
"118] S. A. Kau
man. The Origins of Order Self-Organization and Selection in
Evolution. Oxford University Press, New York, 1993.
"119] W. A. Tackett. Recombination, Selection, and the Genetic Construction of
Computer Programs. PhD thesis, University of Southern California, Los Ange-
les, CA, April 1994.
"120] W. A. Tackett. Greedy recombination and genetic search on the space of com-
puter programs. In L. D. Whitley and M. D. Vose, editors, Foundations of
Genetic Algorithms, volume 3, San Mateo, CA, 1995. Morgan Kaufmann. (To
appear).
"121] N. J. Nilsson and D. Rumelhart. Approaches to Arti cial Intelligence. Tech-
nical Report 93{08{052, Santa Fe Institute, Santa Fe, NM, 1993. Summary of
workshop held November 6{9, 1992. Available via ftp from ftp.santafe.edu in
pub/Users/mm/approaches/approaches.ps.
"122] J.-L. Lauriere. A language and a program for stating and solving combinatorial
problems. Articial Intelligence, 10:29{127, 1978.
"123] S. Loyd. Mathematical Puzzles of Sam Loyd. Dover, new York, 1959.
219
"160] L. Davis. Bit climbing, representational bias and test suite design. In R. K.
Belew and L. B. Booker, editors, Proceedings of the Fourth International Con-
ference on Genetic Algorithms, pages 18{23, San Mateo, CA, 1991. Morgan
Kaufmann.
"161] S. Wright. Evolution in Mendelian populations. Genetics, 16:97{159, 1931.
"162] W. Fontana, P. F. Stadler, E. G. Bornberg-Bauer, T. Griesmacher, I. Hofacker,
M. Tacker, P. Tarazona, E. D. Weinberger, and P. Schuster. RNA folding and
combinatory landscapes. Physical Review E, 47(3):2083{2099, 1993.
"163] M. Huynen. Evolutionary Dynamics and Pattern Generation in the Sequence
and Secondary Structure of RNA: A Bioinformatic Approach. PhD thesis, Uni-
versity of Utrecht, Netherlands, September 1993.
"164] E. D. Weinberger. Local properties of Kau
man's N-k model: A tunably rugged
energy landscape. Physical Review A, 44(10):6399{6413, November 1991.
"165] K. E. Kinnear Jr. Fitness landscapes and diculty in genetic programming. In
Proceedings of the First IEEE Conference on Evolutionary Computing, pages
142{47, 1994.
"166] P. F. Stadler and W. Gruner. Anisotropy in tness landscapes. Journal of
Theoretical Biology, 165:373{388, 1993.
"167] P. Schuster and P. F. Stadler. Landscapes: Complex optimization problems
and biopolymer structures. Computers Chem., 18:295{314, 1994.
"168] P. F. Stadler. Linear operators on correlated landscapes. J. Physique, 4:681{
696, 1994.
"169] T. E.. Davis and J. C. Principe. A simulated annealing like convergence theory
for the simple genetic algorithm. In R. K. Belew and L. B. Booker, editors, Pro-
ceedings of the Fourth International Conference on Genetic Algorithms, pages
174{181, San Mateo, CA, 1991. Morgan Kaufmann.
"170] M. D. Vose. Modeling simple genetic algorithms. In L. D. Whitley, editor,
Foundations of Genetic Algorithms, volume 2, pages 63{73, San Mateo, CA,
1993. Morgan Kaufmann.
"171] J. Suzuki. A Markov chain analysis on a genetic algorithm. In S. Forrest, editor,
Genetic Algorithms: Proceedings of the Fifth International Conference (ICGA
1993), pages 146{153, San Mateo, CA, 1993. Morgan Kaufmann.
223