Genetic Algorithms: Theory and Applications
Lecture Notes Second Edition — WS 2001/2002 by Ulrich Bodenhofer
Institut f¨r Algebra, Stochastik und u wissensbasierte mathematische Systeme Johannes Kepler Universit¨t a A4040 Linz, Austria
FLLL
Fuzzy Logic Laboratorium Linz  Hagenberg
2
Preface
This is a printed collection of the contents of the lecture “Genetic Algorithms: Theory and Applications” which I gave ﬁrst in the winter semester 1999/2000 at the Johannes Kepler University in Linz. The reader should be aware that this manuscript is subject to further reconsideration and improvement. Corrections, complaints, and suggestions are cordially welcome. The sources were manifold: Chapters 1 and 2 were written originally for these lecture notes. All examples were implemented from scratch. The third chapter is a distillation of the books of Goldberg [13] and Hoﬀmann [15] and a handwritten manuscript of the preceding lecture on genetic algorithms which was given by Andreas St¨ckl in 1993 at the Johannes Kepler o University. Chapters 4, 5, and 7 contain recent adaptations of previously published material from my own master thesis and a series of lectures which was given by Francisco Herrera and myself at the Second Summer School on Advanced Control at the Slovak Technical University, Bratislava, in summer 1997 [4]. Chapter 6 was written originally, however, strongly inﬂuenced by A. GeyerSchulz’s works and H. H¨rner’s paper on his C++ GP kernel [18]. o I would like to thank all the students attending the ﬁrst GA lecture in Winter 1999/2000, for remaining loyal throughout the whole term and for contributing much to these lecture notes with their vivid, interesting, and stimulating questions, objections, and discussions. Last but not least, I want to express my sincere gratitude to Sabine Lumpi and Susanne Saminger for support in organizational matters, and Peter Bauer for proofreading.
Ulrich Bodenhofer, February 2000.
3
4
Contents
1 Basic Ideas and Concepts 9 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Deﬁnitions and Terminology . . . . . . . . . . . . . . . . . . . 10 2 A Simple Class of GAs 2.1 Genetic Operations on Binary Strings . . . . . . . . . 2.1.1 Selection . . . . . . . . . . . . . . . . . . . . . 2.1.2 Crossover . . . . . . . . . . . . . . . . . . . . 2.1.3 Mutation . . . . . . . . . . . . . . . . . . . . 2.1.4 Summary . . . . . . . . . . . . . . . . . . . . 2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 A Very Simple One . . . . . . . . . . . . . . . 2.2.2 An Oscillating OneDimensional Function . . 2.2.3 A TwoDimensional Function . . . . . . . . . 2.2.4 Global Smoothness versus Local Perturbations 2.2.5 Discussion . . . . . . . . . . . . . . . . . . . . 3 Analysis 3.1 The Schema Theorem . . . . . . . . . . . . . . . . . 3.1.1 The Optimal Allocation of Trials . . . . . . 3.1.2 Implicit Parallelism . . . . . . . . . . . . . . 3.2 Building Blocks and the Coding Problem . . . . . . 3.2.1 Example: The Traveling Salesman Problem 3.3 Concluding Remarks . . . . . . . . . . . . . . . . . 4 Variants 4.1 Messy Genetic Algorithms . . . . . 4.2 Alternative Selection Schemes . . . 4.3 Adaptive Genetic Algorithms . . . 4.4 Hybrid Genetic Algorithms . . . . . 4.5 SelfOrganizing Genetic Algorithms 5 15 16 16 17 20 20 21 21 23 25 27 28 31 34 37 39 40 43 49 51 51 53 54 54 55
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
6 5 Tuning of Fuzzy Systems Using Genetic Algorithms 5.1 Tuning of Fuzzy Sets . . . . . . . . . . . . . . . . . . . 5.1.1 Coding Fuzzy Subsets of an Interval . . . . . . . 5.1.2 Coding Whole Fuzzy Partitions . . . . . . . . . 5.1.3 Standard Fitness Functions . . . . . . . . . . . 5.1.4 Genetic Operators . . . . . . . . . . . . . . . . 5.2 A Practical Example . . . . . . . . . . . . . . . . . . . 5.2.1 The Fuzzy System . . . . . . . . . . . . . . . . 5.2.2 The Optimization of the Classiﬁcation System . 5.2.3 Concluding Remarks . . . . . . . . . . . . . . . 5.3 Finding Rule Bases with GAs . . . . . . . . . . . . . . 6 Genetic Programming 6.1 Data Representation . . . . . . . . . . . . . . . . 6.1.1 The Choice of the Programming Language 6.2 Manipulating Programs . . . . . . . . . . . . . . . 6.2.1 Random Initialization . . . . . . . . . . . 6.2.2 Crossing Programs . . . . . . . . . . . . . 6.2.3 Mutating Programs . . . . . . . . . . . . . 6.2.4 The Fitness Function . . . . . . . . . . . . 6.3 Fuzzy Genetic Programming . . . . . . . . . . . . 6.4 A Checklist for Applying Genetic Programming .
Contents 57 59 59 60 62 63 65 67 68 73 73 77 78 81 82 83 83 84 84 87 88 89 89 91 92 94 97 98 99 102 103 105
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
7 Classiﬁer Systems 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Holland Classiﬁer Systems . . . . . . . . . . . . . . . . . . 7.2.1 The Production System . . . . . . . . . . . . . . . 7.2.2 The Bucket Brigade Algorithm . . . . . . . . . . . 7.2.3 Rule Generation . . . . . . . . . . . . . . . . . . . . 7.3 Fuzzy Classiﬁer Systems of the Michigan Type . . . . . . . 7.3.1 Directly Fuzzifying Holland Classiﬁer Systems . . . 7.3.2 Bonarini’s ELF Method . . . . . . . . . . . . . . . 7.3.3 Online Modiﬁcation of the Whole Knowledge Base Bibliography
. . . . . . . . .
. . . . . . . . .
List of Figures
2.1 A graphical representation of roulette 2.2 Onepoint crossover of binary strings 2.3 The function f2 . . . . . . . . . . . . 2.4 A surface plot of the function f3 . . . 2.5 The function f4 and its derivative . . 3.1 3.2 3.3 4.1 4.2 4.3 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 6.1 6.2 wheel selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 19 23 25 28
Hypercubes of dimensions 1–4 . . . . . . . . . . . . . . . . . . 33 A hyperplane interpretation of schemata for n = 3 . . . . . . . 34 Minimal deceptive problems . . . . . . . . . . . . . . . . . . . 43 A messy coding . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Positional preference . . . . . . . . . . . . . . . . . . . . . . . 52 The cut and splice operation . . . . . . . . . . . . . . . . . . . 53 Piecewise linear membership function with ﬁxed grid points Simple fuzzy sets with piecewise linear membership functions Simple fuzzy sets with smooth membership functions . . . . A fuzzy partition with N = 4 trapezoidal parts . . . . . . . Example for onepoint crossover of fuzzy partitions . . . . . Mutating a fuzzy partition . . . . . . . . . . . . . . . . . . . Magniﬁcations of typical representatives of the four types of pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clockwise enumeration of neighbor pixels . . . . . . . . . . . Typical gray value curves corresponding to the four types . . The linguistic variables v and e . . . . . . . . . . . . . . . . Cross sections of a function of type (5.2) . . . . . . . . . . . A comparison of results obtained by several diﬀerent optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . A graphical representation of the results . . . . . . . . . . . . . . . . . . . . . . 61 61 61 63 64 65 66 66 67 68 70
. 71 . 72
The tree representation of (+ (* 3 X) (SIN (+ X 1))) . . . 79 The derivation tree of (NOT (x OR y)) . . . . . . . . . . . . . 81 7
8 6.3 6.4 7.1 7.2 7.3 7.4 7.5
List of Figures An example for crossing two binary logical expressions . . . . 85 An example for mutating a derivation tree . . . . . . . . . . . 86 Basic architecture of a classiﬁer system of the Michigan The bucket brigade principle . . . . . . . . . . . . . . . An example for repeated propagation of payoﬀs . . . . A graphical representation of the table shown in Figure Matching a fuzzy condition . . . . . . . . . . . . . . . . type . . . . . . 7.3 . . . . . . . . . 91 96 97 98 100
Chapter 1 Basic Ideas and Concepts
Growing specialization and diversiﬁcation have brought a host of monographs and textbooks on increasingly specialized topics. However, the “tree” of knowledge of mathematics and related ﬁelds does not grow only by putting forth new branches. It also happens, quite often in fact, that branches which were thought to be completely disparate are suddenly seen to be related. Michiel Hazewinkel
1.1
Introduction
Applying mathematics to a problem of the real world mostly means, at ﬁrst, modeling the problem mathematically, maybe with hard restrictions, idealizations, or simpliﬁcations, then solving the mathematical problem, and ﬁnally drawing conclusions about the real problem based on the solutions of the mathematical problem. Since about 60 years, a shift of paradigms has taken place—in some sense, the opposite way has come into fashion. The point is that the world has done well even in times when nothing about mathematical modeling was known. More speciﬁcally, there is an enormous number of highly sophisticated processes and mechanisms in our world which have always attracted the interest of researchers due to their admirable perfection. To imitate such principles mathematically and to use them for solving a broader class of problems has turned out to be extremely helpful in various disciplines. Just brieﬂy, let us mention the following three examples: 9
10
1. Basic Ideas and Concepts
Artiﬁcial Neural Networks (ANNs): Simple models of nerve cells (neurons) and the way they interact; can be used for function approximation, machine learning, pattern recognition, etc. (e.g. [26, 32]). Fuzzy Control: Humans are often able to control processes for which no analytic model is available. Such knowledge can be modeled mathematically by means of linguistic control rules and fuzzy sets (e.g. [21, 31]). Simulated Annealing: Robust probabilistic optimization method mimicking the solidiﬁcation of a crystal under slowly decreasing temperature; applicable to a wide class of problems (e.g. [23, 30]). The fourth class of such methods will be the main object of study throughout this whole series of lectures—Genetic Algorithms (GAs). The world as we see it today, with its variety of diﬀerent creatures, its individuals highly adapted to their environment, with its ecological balance (under the optimistic assumption that there is still one), is the product of a three billion years experiment we call evolution, a process based on sexual and asexual reproduction, natural selection, mutation, and so on [9]. If we look inside, the complexity and adaptability of today’s creatures has been achieved by reﬁning and combining the genetic material over a long period of time. Generally speaking, genetic algorithms are simulations of evolution, of what kind ever. In most cases, however, genetic algorithms are nothing else than probabilistic optimization methods which are based on the principles of evolution. This idea appears ﬁrst in 1967 in J. D. Bagley’s thesis “The Behavior of Adaptive Systems Which Employ Genetic and Correlative Algorithms” [1]. The theory and applicability was then strongly inﬂuenced by J. H. Holland, who can be considered as the pioneer of genetic algorithms [16, 17]. Since then, this ﬁeld has witnessed a tremendous development. The purpose of this lecture is to give a comprehensive overview of this class of methods and their applications in optimization, program induction, and machine learning.
1.2
Deﬁnitions and Terminology
As a ﬁrst approach, let us restrict to the view that genetic algorithms are optimization methods. In general, optimization problems are given in the
1.2. Definitions and Terminology following form: Find an x0 ∈ X such that f is maximal in x0 , where f : X → R is an arbitrary realvalued function, i.e. f (x0 ) = max f (x).
x∈X
11
(1.1)
In practice, it is sometimes almost impossible to obtain global solutions in the strict sense of (1.1). Depending on the actual problem, it can be suﬃcient to have a local maximum or to be at least close to a local or global maximum. So, let us assume in the following that we are interested in values x where the objective function f is “as high as possible”. The search space X can be seen in direct analogy to the set of competing individuals in the real world, where f is the function which assigns a value of “ﬁtness” to each individual (this is, of course, a serious simpliﬁcation). In the real world, reproduction and adaptation is carried out on the level of genetic information. Consequently, GAs do not operate on the values in the search space X, but on some coded versions of them (strings for simplicity). 1.1 Deﬁnition. Assume S to be a set of strings (in nontrivial cases with some underlying grammar). Let X be the search space of an optimization problem as above, then a function c : X −→ S x −→ c(x) is called coding function. Conversely, a function c : S −→ X ˜ s −→ c(s) ˜ is called decoding function. In practice, coding and decoding functions, which have to be speciﬁed depending on the needs of the actual problem, are not necessarily bijective. However, it is in most of the cases useful to work with injective decoding functions (we will see examples soon). Moreover, the following equality is often supposed to be satisﬁed: (c ◦ c) ≡ idS ˜ (1.2)
Finally, we can write down the general formulation of the encoded maximization problem: ˜ Find an s0 ∈ S such that f = f ◦ c is as large as possible ˜ The following table gives a list of diﬀerent expressions, which are common in genetics, along with their equivalent in the framework of GAs:
12 Natural Evolution genotype phenotype chromosome gene allele ﬁtness
1. Basic Ideas and Concepts Genetic Algorithm coded string uncoded point string string position value at a certain position objective function value
After this preparatory work, we can write down the basic structure of a genetic algorithm. 1.2 Algorithm. t := 0; Compute initial population B0 ; WHILE stopping condition not fulﬁlled DO BEGIN select individuals for reproduction; create oﬀsprings by crossing individuals; eventually mutate some individuals; compute new generation END As obvious from the above algorithm, the transition from one generation to the next consists of four basic components: Selection: Mechanism for selecting individuals (strings) for reproduction according to their ﬁtness (objective function value). Crossover: Method of merging the genetic information of two individuals; if the coding is chosen properly, two good parents produce good children. Mutation: In real evolution, the genetic material can by changed randomly by erroneous reproduction or other deformations of genes, e.g. by gamma radiation. In genetic algorithms, mutation can be realized as a random deformation of the strings with a certain probability. The positive eﬀect is preservation of genetic diversity and, as an eﬀect, that local maxima can be avoided. Sampling: Procedure which computes a new generation from the previous one and its oﬀsprings.
1.2. Definitions and Terminology
13
Compared with traditional continuous optimization methods, such as Newton or gradient descent methods, we can state the following signiﬁcant diﬀerences: 1. GAs manipulate coded versions of the problem parameters instead of the parameters themselves, i.e. the search space is S instead of X itself. 2. While almost all conventional methods search from a single point, GAs always operate on a whole population of points (strings). This contributes much to the robustness of genetic algorithms. It improves the chance of reaching the global optimum and, vice versa, reduces the risk of becoming trapped in a local stationary point. 3. Normal genetic algorithms do not use any auxiliary information about the objective function value such as derivatives. Therefore, they can be applied to any kind of continuous or discrete optimization problem. The only thing to be done is to specify a meaningful decoding function. 4. GAs use probabilistic transition operators while conventional methods for continuous optimization apply deterministic transition operators. More speciﬁcally, the way a new generation is computed from the actual one has some random components (we will see later by the help of some examples what these random components are like).
14
1. Basic Ideas and Concepts
Chapter 2 A Simple Class of GAs
Once upon a time a ﬁre broke out in a hotel, where just then a scientiﬁc conference was held. It was night and all guests were sound asleep. As it happened, the conference was attended by researchers from a variety of disciplines. The ﬁrst to be awakened by the smoke was a mathematician. His ﬁrst reaction was to run immediately to the bathroom, where, seeing that there was still water running from the tap, he exclaimed: “There is a solution!”. At the same time, however, the physicist went to see the ﬁre, took a good look and went back to his room to get an amount of water, which would be just suﬃcient to extinguish the ﬁre. The electronic engineer was not so choosy and started to throw buckets and buckets of water on the ﬁre. Finally, when the biologist awoke, he said to himself: “The ﬁttest will survive” and went back to sleep. Anecdote originally told by C. L. Liu In this chapter, we will present a very simple but extremely important subclass—genetic algorithms working with a ﬁxed number of binary strings of ﬁxed length. For this purpose, let us assume that the strings we consider are all from the set S = {0, 1}n , where n is obviously the length of the strings. The population size will be denoted with m in the following. Therefore, the generation at time t is a list of m strings which we will denote with Bt = (b1,t , b2,t , . . . , bm,t ). All GAs in this chapter will obey the following structure: 15
16 2.1 Algorithm.
2. A Simple Class of GAs
t := 0; Compute initial population B0 = (b1,0 , . . . , bm,0 ); WHILE stopping condition not fulﬁlled DO BEGIN FOR i := 1 TO m DO select an individual bi,t+1 from Bt ; FOR i := 1 TO m − 1 STEP 2 DO IF Random[0, 1] ≤ pC THEN cross bi,t+1 with bi+1,t+1 ; FOR i := 1 TO m DO eventually mutate bi,t+1 ; t := t + 1 END Obviously, selection, crossover (done only with a probability of pC here), and mutation are still degrees of freedom, while the sampling operation is already speciﬁed. As it is easy to see, every selected individual is replaced by one of its children after crossover and mutation; unselected individuals die immediately. This is a rather common sampling operation, although other variants are known and reasonable. In the following, we will study the three remaining operations selection, crossover, and mutation.
2.1
2.1.1
Genetic Operations on Binary Strings
Selection
Selection is the component which guides the algorithm to the solution by preferring individuals with high ﬁtness over lowﬁtted ones. It can be a deterministic operation, but in most implementations it has random components. One variant, which is very popular nowadays (we will give a theoretical explanation of its good properties later), is the following scheme, where the
2.1. Genetic Operations on Binary Strings
17
probability to choose a certain individual is proportional to its ﬁtness. It can be regarded as a random experiment with P[bj,t is selected] = f (bj,t )
m
.
(2.1)
f (bk,t )
k=1
Of course, this formula only makes sense if all the ﬁtness values are positive. If this is not the case, a nondecreasing transformation ϕ : R → R+ must be applied (a shift in the simplest case). Then the probabilities can be expressed as ϕ(f (bj,t )) P[bj,t is selected] = m (2.2) ϕ(f (bk,t ))
k=1
We can force the property (2.1) to be satisﬁed by applying a random experiment which is, in some sense, a generalized roulette game. In this roulette game, the slots are not equally wide, i.e. the diﬀerent outcomes can occur with diﬀerent probabilities. Figure 2.1 gives a graphical hint how this roulette wheel game works. The algorithmic formulation of the selection scheme (2.1) can be written down as follows, analogously for the case of (2.2): 2.2 Algorithm. x := Random[0, 1]; i := 1 WHILE i < m & x < i := i + 1; select bi,t ; For obvious reasons, this method is often called proportional selection.
i j=1
f (bj,t )/
m j=1
f (bj,t ) DO
2.1.2
Crossover
In sexual reproduction, as it appears in the real world, the genetic material of the two parents is mixed when the gametes of the parents merge. Usually, chromosomes are randomly split and merged, with the consequence that some genes of a child come from one parent while others come from the other parents.
18
2. A Simple Class of GAs
0.208
0.083
0.167
0.251 0.208
0.083
Figure 2.1: A graphical representation of roulette wheel selection, where the number of alternatives m is 6. The numbers inside the arcs correspond to the probabilities to which the alternative is selected.
This mechanism is called crossover. It is a very powerful tool for introducing new genetic material and maintaining genetic diversity, but with the outstanding property that good parents also produce wellperforming children or even better ones. Several investigations have come to the conclusion that crossover is the reason why sexually reproducing species have adapted faster than asexually reproducing ones.
Basically, crossover is the exchange of genes between the chromosomes of the two parents. In the simplest case, we can realize this process by cutting two strings at a randomly chosen position and swapping the two tails. This process, which we will call onepoint crossover in the following, is visualized in Figure 2.2.
2.1. Genetic Operations on Binary Strings
§¤¦£ ¥ ©¤¦£¤¢ ¨ § ¥ ¡
19
Figure 2.2: Onepoint crossover of binary strings. 2.3 Algorithm. pos := Random{1, . . . , n − 1}; FOR i := 1 TO pos DO BEGIN Child1 [i] := Parent1 [i]; Child2 [i] := Parent2 [i] END FOR i := pos + 1 TO n DO BEGIN Child1 [i] := Parent2 [i]; Child2 [i] := Parent1 [i] END Onepoint crossover is a simple and oftenused method for GAs which operate on binary strings. For other problems or diﬀerent codings, other crossover methods can be useful or even necessary. We mention just a small collection of them, for more details see [11, 13]: N point crossover: Instead of only one, N breaking points are chosen randomly. Every second section is swapped. Among this class, twopoint crossover is particularly important Segmented crossover: Similar to N point crossover with the diﬀerence that the number of breaking points can vary.
$1%(%%)%(3'$1$ $ $ & & & $ $ & &
1'$10&'('1$ & & $ $ & $
& & %1& #
$ $ $ & & %3(%'$('10%&%& & $ $ & %&21$0%&)%(&%'$%$ & $ & $ $ & " !
20
2. A Simple Class of GAs
Uniform crossover: For each position, it is decided randomly if the positions are swapped. Shuﬄe crossover: First a randomly chosen permutation is applied to the two parents, then N point crossover is applied to the shuﬄed parents, ﬁnally, the shuﬄed children are transformed back with the inverse permutation.
2.1.3
Mutation
The last ingredient of our simple genetic algorithm is mutation—the random deformation of the genetic information of an individual by means of radioactive radiation or other environmental inﬂuences. In real reproduction, the probability that a certain gene is mutated is almost equal for all genes. So, it is near at hand to use the following mutation technique for a given binary string s, where pM is the probability that a single gene is modiﬁed: 2.4 Algorithm. FOR i := 1 TO n DO IF Random[0, 1] < pM THEN invert s[i]; Of course, pM should be rather low in order to avoid that the GA behaves chaotically like a random search. Again, similar to the case of crossover, the choice of the appropriate mutation technique depends on the coding and the problem itself. We mention a few alternatives, more details can be found in [11] and [13] again: Inversion of single bits: With probability pM , one randomly chosen bit is negated. Bitwise inversion: The whole string is inverted bit by bit with prob. pM . Random selection: With probability pM , the string is replaced by a randomly chosen one.
2.1.4
Summary
If we ﬁll in the methods described above, we can write down a universal genetic algorithm for solving optimization problems in the space S = {0, 1}n .
2.2. Examples 2.5 Algorithm.
t := 0; Create initial population B0 = (b1,0 , . . . , bm,0 ); WHILE stopping condition not fulﬁlled DO BEGIN (∗ proportional selection ∗) FOR i := 1 TO m DO BEGIN x := Random[0, 1]; k := 1; WHILE k < m & x < k := k + 1; bi,t+1 := bk,t END (∗ onepoint crossover ∗) FOR i := 1 TO m − 1 STEP 2 DO BEGIN IF Random[0, 1] ≤ pC THEN BEGIN pos := Random{1, . . . , n − 1}; FOR k := pos + 1 TO n DO BEGIN aux := bi,t+1 [k]; bi,t+1 [k] := bi+1,t+1 [k]; bi+1,t+1 [k] := aux END END END (∗ mutation ∗) FOR i := 1 TO m DO FOR k := 1 TO n DO IF Random[0, 1] < pM THEN invert bi,t+1 [k]; t := t + 1 END
k j=1 m j=1
21
f (bj,t )/
f (bj,t ) DO
2.2
2.2.1
Examples
A Very Simple One
Consider the problem of ﬁnding the global maximum of the following function: f1 : {0, . . . , 31} −→ R x −→ x2
22
2. A Simple Class of GAs
Of course, the solution is obvious, but the simplicity of this problem allows us to compute some steps by hand in order to gain some insight into the principles behind genetic algorithms. The ﬁrst step on the checklist of things, which have to be done in order to make a GA work, is, of course, to specify a proper string space along with an appropriate coding and decoding scheme. In this example, it is near at hand to consider S = {0, 1}5 , where a value from {0, . . . , 31} is coded by its binary representation. Correspondingly, a string is decoded as
4
c(s) = ˜
i=0
s[4 − i] · 2i .
Like in [13], let us assume that we use Algorithm 2.5 as it is, with a population size of m = 4, a crossover probability pC = 1 and a mutation probability of pM = 0.001. If we compute the initial generation randomly with uniform distribution over {0, 1}5 , we obtain the following in the ﬁrst step: Individual No. 1 2 3 4 String x value f (x) (genotype) (phenotype) x2 0 1 1 0 1 13 169 1 1 0 0 0 24 576 0 1 0 0 0 8 64 1 0 0 1 1 19 361 pselecti
fi fj
0.14 0.49 0.06 0.31
One can compute easily that the sum of ﬁtness values is 1170, where the average is 293 and the maximum is 576. We see from the last column in which way proportional selection favors highﬁtted individuals (such as no. 2) over lowﬁtted ones (such as no. 3). A random experiment could, for instance, give the result that individuals no. 1 and no. 4 are selected for the new generation, while no. 3 dies and no. 2 is selected twice, and we obtain the second generation as follows: Set of selected Crossover site individuals (random) 0 1 1 01 4 1 1 0 00 4 1 10 0 0 2 1 00 1 1 2 New x population value 0 1 1 0 0 12 1 1 0 0 1 25 1 1 0 1 1 27 1 0 0 0 0 16 f (x) x2 144 625 729 256
2.2. Examples
23
2
1.5
1
0.5
1
0.5
0.5
1
Figure 2.3: The function f2 . So, we obtain a new generation with a sum of ﬁtness values of 1754, an average of 439, and a maximum of 729. We can see from this very basic example in which way selection favors highﬁtted individuals and how crossover of two parents can produce an oﬀspring which is even better than both of its parents. It is left to the reader as an exercise to continue this example.
2.2.2
An Oscillating OneDimensional Function
Now we are interested in the global maximum of the function f2 : [−1, 1] −→ R 2 x −→ 1 + e−x · cos(36x). As one can see easily from the plot in Figure 2.3, the function has a global maximum in 0 and a lot of local maxima. First of all, in order to work with binary strings, we have to discretize the search space [−1, 1]. A common technique for doing so is to make a uniform grid of 2n points, then to enumerate the grid points, and to use the binary representation of the point index as coding. In the general form (for an arbitrary interval [a, b]), this looks as follows: cn,[a,b] : [a, b] −→ {0, 1}n x −→ binn round (2n − 1) · (2.3)
x−a b−a
,
24
2. A Simple Class of GAs
where binn is the function which converts a number from {0, . . . , 2n−1 } to its binary representation of length n. This operation is not bijective since information is lost due to the rounding operation. Obviously, the corresponding decoding function can be deﬁned as cn,[a,b] : ˜ {0, 1}n −→ [a, b] s −→ a + bin−1 (s) · n
b−a . 2n −1
(2.4)
It is left as an exercise to show that the decoding function cn,[a,b] is injective ˜ and that the equality (1.2) holds for the pair (cn,[a,b] , cn,[a,b] ). ˜ Applying the above coding scheme to the interval [−1, 1] with n = 16, we get a maximum accuracy of the solution of 1 2 · 16 ≈ 1.52 · 10−5 . 2 2 −1 Now let us apply Algorithm 2.5 with m = 6, pC = 1, and pM = 0.005. The ﬁrst and the last generation are given as follows:
Generation 1 max. fitness 1.9836 #0 0111111101010001 fitness: #1 1101111100101011 fitness: #2 0111111101011011 fitness: #3 1001011000011110 fitness: #4 1001101100101011 fitness: #5 1100111110011110 fitness: Average Fitness: 1.41 ... Generation 52 max. fitness 2.0000 #0 0111111101111011 fitness: #1 0111111101111011 fitness: #2 0111111101111011 fitness: #3 0111111111111111 fitness: #4 0111111101111011 fitness: #5 0111111101111011 fitness: Average Fitness: 1.99 at 0.0000 1.99 1.99 1.99 2.00 1.99 1.99 at 0.0050 1.98 0.96 1.98 1.97 1.20 0.37
We see that the algorithm arrives at the global maximum after 52 generations, i.e. it suﬃces with at most 52 × 6 = 312 evaluations of the ﬁtness function, while the total size of the search space is 216 = 65536. We can draw the conclusion—at least for this example—that the GA is deﬁnitely better than a pure random search or an exhaustive method which stupidly scans the whole search space.
2.2. Examples
25
1 0.75 0.5 0.25 0 10 5 0 5 10 10 5 5 10
0
Figure 2.4: A surface plot of the function f3 . Just in order to get more insight into the coding/decoding scheme, let us take the best string 0111111111111111. Its representation as integer number is 32767. Computing the decoding function yields −1 + 32767 · 1 − (−1) = −1 + 0.9999847 = −0.0000153. 65535
2.2.3
A TwoDimensional Function
As next example, we study the function f3 : [−10, 10]2 −→ R (x, y) −→ √
1−sin2 ( x2 +y 2 ) . 1+0.001·(x2 +y 2 )
As one can see easily from the plot in Figure 2.4, the function has a global maximum in 0 and a lot of local maxima. Let us use the coding/decoding scheme as shown in (2.3) and (2.4) for the two components x and y independently with n = 24, i.e. c24,[−10,10] and c24,[−10,10] are used as coding and decoding functions, respectively. In order ˜ to get a coding for the twodimensional vector, we can use concatenation and
26 splitting:
2. A Simple Class of GAs
c3 : [−10, 10]2 −→ {0, 1}48 (x, y) −→ c24,[−10,10] (x)c24,[−10,10] (y) c3 : ˜ {0, 1}48 −→ [−10, 10]2 s −→ c24,[−10,10] (s[1 : 24]), c24,[−10,10] (s[25 : 48]) ˜ ˜
If we apply Algorithm 2.5 with m = 50, pC = 1, pM = 0.01, we observe that a fairly good solution is reached after 693 generations (at most 34650 evaluations at a search space size of 2.81 · 1014 ):
Generation 693 max. fitness 0.9999 at (0.0098,0.0000) #0 000000001000000001000000000000000000000010000000 #1 000001000000011001000110000000000000000010100010 #2 000000001000000000100000000000000000000010000000 #3 000000001000001001000000000000000000000010000000 #4 000000001000001011001000000000000000000010000011 #5 000000101000000001000010000100000000000010000000 #6 000000001000000011000000000000001000000010000011 #7 000000001000000001100000000010000000000110000000 #8 000000001001000001000000000000000000000000100010 #9 000000001000000001000000000000000000000010100010 #10 000000001000011011000000000000000000000010000000 #11 000000001000000001000000000000000000000010100000 #12 000000001000001000010010000000000000000010001001 #13 000000001000001011000000000000000000000010100010 #14 000000001000000001000001000000000000000010000000 #15 000000001000000001100000100000000000000010000000 #16 000000001000001010001000000000000000000010100010 #17 000000001000011011000000000000000000000010000011 #18 000000001000001011001000000000000000000010000011 #19 000000001000011001000010001000010000000010000010 #20 000000001000000001000000000001000000000010100010 #21 000000001000011001100000000000000000010010000000 #22 000000001000000101100000000000000000010010000000 #23 000000001000100001000000000000000000000010000111 #24 000000001000000011000000000000000000000000000000 #25 000000001000000001011000000000010000000010100010 #26 000000001000000001001000000000000000000000100010 #27 000000001000001011000010000000000000000010100010 #28 000000001000001011100010000000000000000010101010 #29 010000001000000011000000000000000010010010000000 #30 000000001000001011000000000000000000000010000011 #31 000000001000011011000000000000000000000011000011 #32 000000001000001001100000000000000000000010000000 #33 000000001001001011000110000000000000000011110100 #34 000000001000000000000000000000000000000010100010 #35 000000001000001011001000000000000000000010000010 #36 000000001000011011000000000000000010000010000001 #37 000000001000001011000000000010000000000010100010 #38 000000001000001011000010010000000000000010000000 #39 000000001000000001000000000001000000000010100010 #40 000000001000001001000110000000000000000011010100 #41 000000001010000001000000000000000000000010000000 #42 000000001000001001100110000000000000000011010100 #43 000000000000000000000000000000000000000010000011 #44 000000001000001011001000000000000000000010100000 #45 000000001000001011000110000000000000000011110100 #46 000000000000000000000000000000000000000010000000 #47 000000001000010001000110000000000000000010000000 #48 000000001000001011000000000000000000000010100011 #49 000000001000000111000000000000000000000010000001 Average Fitness: 0.53 fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: fitness: 1.00 0.00 1.00 0.97 0.90 0.00 0.00 0.00 0.14 0.78 0.75 0.64 0.56 0.78 1.00 0.00 0.78 0.70 0.90 0.00 0.00 0.00 0.00 0.44 0.64 0.00 0.23 0.78 0.97 0.00 0.90 0.26 0.97 0.87 0.78 0.93 0.00 0.00 0.00 0.00 0.88 0.66 0.88 0.64 0.65 0.81 0.64 0.89 0.84 0.98
Again, we learn from this example that the GA is here for sure much faster than an exhaustive algorithm or a pure random search. The question arises, since f3 is perfectly smooth, which result we obtain if we apply a conventional method with random selection of the initial value. In this example,
2.2. Examples
27
the expectation is obvious: The global maximum (0, 0) is surrounded by a ring of minima at a radius of π . If we apply, for instance, BFGS (Broy2 den Fletcher Goldfarb Shanno—a very eﬃcient QuasiNewton method for continuous unconstrained function optimization [7]) with line search, it is likely that convergence to the global maximum is achieved if the initial value is inside that ring, but only in this case. If we take the initial value from [−10, 10]2 randomly with uniform distribution, the probability to get a value from the appropriate neighborhood of the global maximum is ·π π3 = = 0.0775. 10 · 10 400 The expected number of trials until we get an initial value is, therefore, 1 ≈ 13. In a test implementation, it took 15 trials (random initial values) 0.0775 until the correct global optimum was found by the BFGS method with line search. The total time for all these computations was 5 milliseconds on an SGI O2 (MIPS R5000/180SC). The genetic algorithm, as above, took 1.5 seconds until it found the global optimum with comparable accuracy. This example shows that GAs are not necessarily fast. Moreover, they are in many cases much slower than conventional methods which involve derivatives. The next example, however, will drastically show us that there are even smooth functions which can be hard for conventional optimization techniques.
π 2 2
2.2.4
Global Smoothness versus Local Perturbations
f4 : [−2, 2] −→ R 2 x −→ e−x + 0.01 cos(200x).
Consider the function
As easy to see from Figure 2.5, this function has a clear belllike shape with small but highly oscillating perturbations. In the ﬁrst derivative, these oscillations are drastically emphasized (see Figure 2.5): f4 (x) = −2xe−x − 2 sin(200x)
2
We applied the simple GA as in Algorithm 2.5 with n = 16, i.e. the pair c16,[−2,2] /˜16,[−2,2] as coding/decoding scheme, m = 10, pC = 1, and ˜ c pM = 0.005. The result was that the global maximum at x = 0 was found after 9 generations (i.e. at most 90 evaluations of the ﬁtness function) and
28
2. A Simple Class of GAs
1
0.8
0.6
0.4
0.2
2
1
1
2
2
1
2
1
1
2
1
2
Figure 2.5: The function f4 (top) and its derivative (bottom). 5 milliseconds computation time, respectively (on the same computer as above). In order to repeat the above comparison, BFGS with line search and random selection of the initial value was applied to f4 as well. The global optimum was found after 30 trials (initial values) with perfect accuracy, but 9 milliseconds of computation time. We see that, depending on the structure of the objective function, a GA can even outperform an acknowledged conventional method which makes use of derivatives.
2.2.5
Discussion
Finally, let us summarize some conclusions about the four examples above:
2.2. Examples
29
Algorithm 2.5 is very universal. More or less, the same algorithm has been applied to four fundamentally diﬀerent optimization tasks. As seen in 2.2.4, GAs can even be faster in ﬁnding global maxima than conventional methods, in particular when derivatives provide misleading information. We should not forget, however, that, in most cases where conventional methods can be applied, GAs are much slower because they do not take auxiliary information like derivatives into account. In these optimization problems, there is no need to apply a GA which gives less accurate solutions after much longer computation time. The enormous potential of GAs lies elsewhere—in optimization of nondiﬀerentiable or even discontinuous functions, discrete optimization, and program induction.
30
2. A Simple Class of GAs
Chapter 3 Analysis
Although the belief that an organ so perfect as the eye could have been formed by natural selection, is enough to stagger any one; yet in the case of any organ, if we know of a long series of gradations in complexity, each good for its possessor, then, under changing conditions of life, there is no logical impossibility in the acquirement of any conceivable degree of perfection through natural selection. Charles R. Darwin In this remark, Darwin, in some sense, tries to turn around the burden of proof for his theory simply by saying that there is no evidence against it. This chapter is intended to give an answer to the question why genetic algorithms work—in a way which is philosophically more correct than Darwin’s. However, we will see that, as in Darwin’s theory of evolution, the complexity of the mechanisms makes mathematical analysis diﬃcult and complicated. For conventional deterministic optimization methods, such as gradient methods, Newton or QuasiNewton methods, etc., it is rather usual to have results which guarantee that the sequence of iterations converges to a local optimum with a certain speed or order. For any probabilistic optimization method, theorems of this kind cannot be formulated, because the behavior of the algorithm is not determinable in general. Statements about the convergence of probabilistic optimization methods can only give information about the expected or average behavior. In the case of genetic algorithms, there are a few circumstances which make it even more diﬃcult to investigate their convergence behavior: 31
32
3. Analysis • Since a single transition from one generation to the next is a combination of usually three probabilistic operators (selection, crossover, and mutation), the inner structure of a genetic algorithm is rather complicated. • For each of the involved probabilistic operators, many diﬀerent variants have been proposed, thus it is not possible to give general convergence results due to the fact that the choice of the operators inﬂuences the convergence fundamentally.
In the following, we will not be able to give “hard” convergence theorems, but only a summary of results giving a clue why genetic algorithms work for many problems but not necessarily for all problems. For simplicity, we will restrict to algorithms of type 2.1, i.e. GAs with a ﬁxed number m of binary strings of ﬁxed length n. Unless stated otherwise, no speciﬁc assumptions about selection, crossover, or mutation will be made. Let us brieﬂy reconsider the example in 2.2.1. We saw that the transition from the ﬁrst to the second generation is given as follows: Gen. 0 1 1 1 1 0 0 1 0 1 0 0 #1 f (x) 0 1 169 0 0 576 0 0 64 1 1 361 Gen. 0 1 1 1 1 0 1 1 0 1 0 0 #2 f (x) 0 0 144 0 1 625 1 1 729 0 0 256
=⇒
It is easy to see that it is advantageous to have a 1 in the ﬁrst position. In fact, the number of strings having this property increased from 2 in the ﬁrst to 3 in the second generation. The question arises whether this is a coincidence or simply a clue to the basic principle why GAs work. The answer will be that the latter is the case. In order to investigate these aspects formally, let us make the following deﬁnition. 3.1 Deﬁnition. A string H = (h1 , . . . , hn ) over the alphabet {0, 1, ∗} is called a (binary) schema of length n. An hi = ∗ is called a speciﬁcation of H, an hi = ∗ is called wildcard. It is not diﬃcult to see that schemata can be considered as speciﬁc subsets of {0, 1}n if we consider the following function which maps a schema to its associated subset. i: {0, 1, ∗}n −→ P({0, 1}n ) H −→ {S  ∀1 ≤ i ≤ n : (hi = ∗) ⇒ (hi = si )}
33
01
11
n=1
0
1
n=2
00 10
0111
1111
011
111
0110 0011 1011 1110
001
101
n=3
000
0010 0101
1010 1101
010 110
0100
n=4
1100
100
0001
1001
0000
1000
Figure 3.1: Hypercubes of dimensions 1–4. If we interpret binary strings of length n as hypercubes of dimension n (cf. Figure 3.1), schemata can be interpreted as hyperplanes in these hypercubes (see Figure 3.2 for an example with n = 3). Before turning to the ﬁrst important result, let us make some fundamental deﬁnitions concerning schemata. 3.2 Deﬁnition. 1. A string S = (s1 , . . . , sn ) over the alphabet {0, 1} fulﬁlls the schema H = (h1 , . . . , hn ) if and only if it matches H is all nonwildcard positions: ∀i ∈ {j  hj = ∗} : si = hi According to the discussion above, we write S ∈ H. 2. The number of speciﬁcations of a schema H is called order and denoted as O(H) = {i ∈ {1, . . . , n}hi = ∗}. 3. The distance between the ﬁrst and the last speciﬁcation δ(H) = max{ihi = ∗} − min{ihi = ∗} is called the deﬁning length of a schema H.
34
3. Analysis
**1 plane 011 0*1 line
*11 line 111
001
101
11* line
00* line
010 110
1** plane
000 *00 line
100 *0* plane
1*0 line
Figure 3.2: A hyperplane interpretation of schemata for n = 3.
3.1
The Schema Theorem
In this section, we will formulate and prove the fundamental result on the behavior of genetic algorithms—the socalled Schema Theorem. Although being completely incomparable with convergence results for conventional optimization methods, it still provides valuable insight into the intrinsic principles of GAs. Assume in the following, that we have a genetic algorithm of type 2.1 with proportional selection and an arbitrary but ﬁxed ﬁtness function f . Let us make the following notations: 1. The number of individuals which fulﬁll H at time step t are denoted as rH,t = Bt ∩ H . ¯ 2. The expression f (t) refers to the observed average ﬁtness at time t: 1 ¯ f (t) = m
m
f (bi,t )
i=1
¯ 3. The term f (H, t) stands for the observed average ﬁtness of schema H in time step t: 1 ¯ f (H, t) = f (bi,t ) rH,t
i∈{jbj,t ∈H}
3.1. The Schema Theorem
35
3.3 Theorem (Schema Theorem—Holland 1975). Assuming we consider a genetic algorithm of type 2.5, the following inequality holds for every schema H: ¯ f (H, t) δ(H) E[rH,t+1 ] ≥ rH,t · ¯ · (1 − pM )O(H) (3.1) · 1 − pC · n−1 f (t) Proof. The probability that we select an individual fulﬁlling H is (compare with Eq. (2.1)) f (bi,t )
i∈{jbj,t ∈H} m
.
(3.2)
f (bi,t )
i=1
This probability does not change throughout the execution of the selection loop. Moreover, every of the m individuals is selected completely independently from the others. Hence, the number of selected individuals, which fulﬁll H, is binomially distributed with sample amount m and the probability in (3.2). We obtain, therefore, that the expected number of selected individuals fulﬁlling H is f (bi,t ) m·
i∈{jbj,t ∈H} m
f (bi,t )
i=1
rH,t =m· · rH,t
f (bi,t )
i∈{jbj,t ∈H} m
f (bi,t )
i=1 f (bi,t )
i∈{jbj,t ∈H}
= rH,t ·
rH,t
m
= rH,t ·
f (bi,t )
i=1
¯ f (H, t) ¯ f (t)
m
If two individuals are crossed, which both fulﬁll H, the two oﬀsprings again fulﬁll H. The number of strings fulﬁlling H can only decrease if one string, which fulﬁlls H, is crossed with a string which does not fulﬁll H, but, obviously, only in the case that the cross site is chosen somewhere in between the speciﬁcations of H. The probability that the cross site is chosen within the deﬁning length of H is δ(H) . n−1 Hence the survival probability pS of H, i.e. the probability that a string fulﬁlling H produces an oﬀspring also fulﬁlling H, can be estimated as follows (crossover is only done with probability pC ): pS ≥ 1 − pC · δ(H) n−1
36
3. Analysis
Selection and crossover are carried out independently, so we may compute the expected number of strings fulﬁlling H after crossover simply as ¯ ¯ f (H, t) f (H, t) δ(H) · rH,t · pS ≥ ¯ · rH,t · 1 − pC · ¯ n−1 f (t) f (t) .
After crossover, the number of strings fulﬁlling H can only decrease if a string fulﬁlling H is altered by mutation at a speciﬁcation of H. The probability that all speciﬁcations of H remain untouched by mutation is obviously (1 − pM )O(H) . Applying the same argument like above, Equation (3.1) follows. The arguments in the proof of the Schema Theorem can be applied analogously to many other crossover and mutation operations. 3.4 Corollary. For a genetic algorithm of type 2.1 with roulette wheel selection, the inequality holds E[rH,t+1 ] ≥ ¯ f (H, t) · rH,t · PC (H) · PM (H) ¯ f (t) (3.3)
for any schema H, where PC (H) is a constant only depending on the schema H and the crossover method and PM (H) is a constant which solely depends on H and the involved mutation operator. For the variants discussed in 2.1.2 and 2.1.3, we can give the following estimates: PC (H) = 1 − pC ·
δ(H) n−1 1 O(H) 2
onepoint crossing over uniform crossing over any other crossing over method bitwise mutation inversion of a single bit bitwise inversion random selection
PC (H) = 1 − pC · 1 − PC (H) = 1 − pC PM (H) = (1 − pM )O(H) PM (H) = 1 − pM · O(H) n PM (H) = 1 − pM PM (H) = 1 − pM · H 2n
Even the inattentive reader must have observed that the Schema Theorem is somehow diﬀerent from convergence results for conventional optimization methods. It seems that this result raises more questions than it is ever able to answer. At least one insight is more or less obvious: Schemata with aboveaverage ﬁtness and short deﬁning length—let us put aside the generalizations
3.1. The Schema Theorem
37
made in Corollary 3.4 for our following studies—tend to produce more oﬀsprings than others. For brevity, let us call such schemata building blocks. It will become clear in a moment why this term is appropriate. If we assume that the quotient ¯ f (H, t) ¯ f (t) is approximately stationary, i.e. independent of time and the actual generations, we immediately see that the number of strings, which belong to aboveaverage schemata with short deﬁning lengths, grows exponentially (like a geometric sequence). This discovery poses the question whether it is a wise strategy to let aboveaverage schemata receive an exponentially increasing number of trials and, if the answer is yes, why this is the case. In 3.1.1, we will try to shed more light on this problem. There is one other fundamental question we have yet not touched at all: Undoubtedly, GAs operate on binary strings, but not on schemata. The Schema Theorem, more or less, provides an observation of all schemata, which all grow and decay according to their observed average ﬁtness values in parallel. What is actually the interpretation of this behavior and why is this a good thing to do? Subsection 3.1.2 is devoted to this topic. Finally, one might ask where the crucial role of schemata with aboveaverage ﬁtness and short deﬁning length comes from and what the inﬂuence of the ﬁtness function and the coding scheme is. We will attack these problems in 3.2.
3.1.1
The Optimal Allocation of Trials
The Schema Theorem has provided the insight that building blocks receive exponentially increasing trials in future generations. The question remains, however, why this could be a good strategy. This leads to an important and wellanalyzed problem from statistical decision theory—the twoarmed bandit problem and its generalization, the karmed bandit problem. Although this seems like a detour from our main concern, we shall soon understand the connection to genetic algorithms. Suppose we have a gambling machine with two slots for coins and two arms. The gambler can deposit the coin either into the left or the right slot. After pulling the corresponding arm, either a reward is payed or the coin is lost. For mathematical simplicity, we just work with outcomes, i.e. the diﬀerence between the reward (which can be zero) and the value of the coin.
38
3. Analysis
Let us assume that the left arm produces an outcome with mean value µ1 2 and a variance σ1 while the right arm produces an outcome with mean value 2 µ2 and variance σ2 . Without loss of generality, although the gambler does not know this, assume that µ1 ≥ µ2 . The question arises which arm should be played. Since we do not know beforehand which arm is associated with the higher outcome, we are faced with an interesting dilemma. Not only must me make a sequence of decisions which arm to play, we have to collect, at the same time, information about which is the better arm. This tradeoﬀ between exploration of knowledge and its exploitation is the key issue in this problem and, as turns out later, in genetic algorithms, too. A simple approach to this problem is to separate exploration from exploitation. More speciﬁcally, we could perform a single experiment at the beginning and thereafter make an irreversible decision that depends on the results of the experiment. Suppose we have N coins. If we ﬁrst allocate an equal number n (where 2n ≤ N ) of trials to both arms, we could allocate the remaining N − 2n trials to the observed better arm. Assuming we know all involved parameters [13], the expected loss is given as L(N, n) = (µ1 − µ2 ) · (N − n)q(n) + n(1 − q(n)) where q(n) is the probability that the worst arm is the observed best arm after the 2n experimental trials. The underlying idea is obvious: In case that we observe that the worse arm is the best, which happens with probability q(n), the total number of trials allocated to the right arm is N − n. The loss is, therefore, (µ1 − µ2 ) · (N − n). In the reverse case that we actually observe that the best arm is the best, which happens with probability 1 − q(n), the loss is only what we get less because we played the worse arm n times, i.e. (µ1 − µ2 ) · n. Taking the central limit theorem into account, we can approximate q(n) with the tail of a normal distribution: e−c /2 1 , q(n) ≈ √ · c 2π
2
where c =
µ1 − µ2
2 2 σ1 + σ2
·
√
n
Now we have to specify a reasonable experiment size n. Obviously, if we choose n = 1, the obtained information is potentially unreliable. If we choose, however, n = N there are no trials left to make use of the information gained 2 through the experimental phase. What we see is again the tradeoﬀ between exploitation with almost no exploration (n = 1) and exploration without exploitation (n = N ). It does not take a Nobel price winner to see that 2 the optimal way is somewhere in the middle. Holland [16] has studied this
3.1. The Schema Theorem
39
problem is very detail. He came to the conclusion that the optimal strategy is given by the following equation: n∗ ≈ b2 ln N2 8πb4 ln N 2 , where b = σ1 . µ1 − µ2
Making a few transformations, we obtain that √ ∗ 2 N − n∗ ≈ 8πb4 ln N 2 · en /2b , i.e. the optimal strategy is to allocate slightly more than an exponentially increasing number of trials to the observed best arm. Although no gambler is able to apply this strategy in practice, because it requires knowledge of the mean values µ1 and µ2 , we still have found an important bound of performance a decision strategy should try to approach. A genetic algorithm, although the direct connection is not yet fully clear, actually comes close to this ideal, giving at least an exponentially increasing number trials to the observed best building blocks. However, one may still wonder how the twoarmed bandit problem and GAs are related. Let us consider an arbitrary string position. Then there are two schemata of order one which have their only speciﬁcation in this position. According to the Schema Theorem, the GA implicitly decides between these two schemata, where only incomplete data are available (observed average ﬁtness values). In this sense, a GA solves a lot of twoarmed problems in parallel. The Schema Theorem, however, is not restricted to schemata with an order of 1. Looking at competing schemata (diﬀerent schemata which are speciﬁed in the same positions), we observe that a GA is solving an enormous number of karmed bandit problems in parallel. The karmed bandit problem, although much more complicated, is solved in an analogous way [13, 16]—the observed better alternatives should receive an exponentially increasing number of trials. This is exactly what a genetic algorithm does!
3.1.2
Implicit Parallelism
So far we have discovered two distinct, seemingly conﬂicting views of genetic algorithms: 1. The algorithmic view that GAs operate on strings. 2. The schemabased interpretation.
40
3. Analysis
So, we may ask what a GA really processes, strings or schemata? The answer is surprising: Both. Nowadays, the common interpretation is that a GA processes an enormous amount of schemata implicitly. This is accomplished by exploiting the currently available, incomplete information about these schemata continuously, while trying to explore more information about them and other, possibly better schemata. This remarkable property is commonly called the implicit parallelism of genetic algorithms. A simple GA as presented in Chapter 2 processes only m structures in one time step, without any memory or bookkeeping about the previous generations. We will now try to get a feeling how many schemata a GA actually processes. Obviously, there are 3n schemata of length n. A single binary string fulﬁlls n schemata of order 1, n schemata of order 2, in general, n schemata of k 2 order k. Hence, a string fulﬁlls
n
k=1
n k
= 2n
schemata. Thus, for any generation, we obtain that there are between 2n and m · 2n schemata which have at least one representative. But how many schemata are actually processed? Holland [16] has given an estimation of the quantity of schemata that are taken over to the next generation. Although the result seems somewhat clumsy, it still provides important information about the large quantity of schemata which are inherently processed in parallel while, in fact, considering a relatively small quantity of strings. 3.5 Theorem. Consider a randomly generated start population of a simple GA of type 2.5 and let ε ∈ (0, 1) be a ﬁxed error bound. Then schemata of length ls < ε · (n − 1) + 1 have a probability of at least 1 − ε to survive onepoint crossover (compare with the proof of the Schema Theorem). If the population size is chosen as m = 2ls /2 , the number of schemata, which survive for the next generation, is of order O(m3 ).
3.2
Building Blocks and the Coding Problem
We have already introduced the term “building block” for a schema with high average ﬁtness and short deﬁning length (implying small order). Now
3.2. Building Blocks and the Coding Problem
41
it is time to explain why this notation is appropriate. We have seen in the Schema Theorem and 3.1.1 that building blocks receive an exponentially increasing number of trials. The considerations in 3.1.2 have demonstrated that a lot of schemata (including building blocks) are evaluated implicitly and in parallel. What we still miss is the link to performance, i.e. convergence. Unfortunately, there is no complete theory which gives a clear answer, just a hypothesis. 3.6 Building Block Hypothesis. A genetic algorithm creates stepwise better solutions by recombining, crossing, and mutating short, loworder, highﬁtness schemata. Goldberg [13] has found a good comparison for pointing out the main assertion of this hypothesis: Just as a child creates magniﬁcent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, loworder, highperformance schemata, or building blocks. This seems a reasonable assumption and ﬁts well to the Schema Theorem. The question is now if and when it holds. We ﬁrst consider an aﬃne linear ﬁtness function n f (s) = a +
i=1
ci · s[i],
i.e. the ﬁtness is computed as a linear combination of all genes. It is easy to see that the optimal value can be determined for every gene independently (only depending on the sign of the scaling factors ci ). Conversely, let us consider a needleinhaystack problem as the other extreme: 1 if x = x0 f (x) = 0 otherwise Obviously, there is a single string x0 which is the optimum, but all other strings have equal ﬁtness values. In this case, certain values on single positions (schemata) do not provide any information for guiding an optimization algorithm to the global optimum. In the linear case, the building block hypothesis seems justiﬁed. For the second function, however, it cannot be true, since there is absolutely no information available which could guide a GA to the global solution through partial, suboptimal solutions. In other words, the more the positions can be
42
3. Analysis
judged independently, the easier it is for a GA. On the other hand, the more positions are coupled, the more diﬃcult it is for a GA (and for any other optimization method). Biologists have come up with a special term for this kind of nonlinearity— epistasis. Empirical studies have shown that GAs are appropriate for problems with medium epistasis. While almost linear problems (i.e. with low epistasis) can be solved much more eﬃciently with conventional methods, highly epistatic problems cannot be solved eﬃciently at all [15]. We will now come to a very important question which is strongly related to epistasis: Do good parents always produce children of comparable or even better ﬁtness (the building block hypothesis implicitly relies on this)? In natural evolution, this is almost always true. For genetic algorithms, this is not so easy to guarantee. The disillusioning fact is that the user has to take care of an appropriate coding in order to make this fundamental property hold. In order to get a feeling for optimization tasks which could foul a GA, we will now try to construct a very simple misleading example. Apparently, for n = 1, no problems can occur, the twobit problem n = 2 is the ﬁrst. Without loss of generality, assume that 11 is the global maximum. Next we introduce the element of deception necessary to make this a tough problem for a simple GA. To do this, we want a problem where one or both of the suboptimal order1 schemata are better than the optimal order1 schemata. Mathematically, we want one or both of the following conditions to be fulﬁlled: f (0*) > f (1*), f (*0) > f (*1), i.e. f (00) + f (01) f (10) + f (11) > , (3.6) 2 2 f (00) + f (10) f (01) + f (11) > . (3.7) 2 2 Both expressions cannot hold simultaneously, since this would contradict to the maximality of 11. Without any loss of generality, we choose the ﬁrst condition for our further considerations. In order to put the problem into closer perspective, we normalize all ﬁtness values with respect to the complement of the global optimum: r= f (11) f (00) c= f (01) f (00) c = f (10) f (00) (3.4) (3.5)
3.2. Building Blocks and the Coding Problem
43
11
11 00
00
10
10
Figure 3.3: Minimal deceptive problems of type I (left) and type II (right). The maximality condition implies: r>c r>1 r>c
The deception conditions (3.4) and (3.6), respectively, read as follows: r <1+c−c From these conditions, we can conclude the following facts: c <1 c <c
We see that there are two possible types of minimal deceptive twobit problems based on (3.4): Type I: f (01) > f (00) (c > 1) Type II: f (01) ≤ f (00) (c ≤ 1) Figure 3.3 shows sketches of these two fundamental types of deceptive problems. It is easy to see that both ﬁtness functions are nonlinear. In this sense, epistasis is again the bad property behind the deception in these problems.
3.2.1
Example: The Traveling Salesman Problem
We have already mentioned that it is essential for a genetic algorithm that good individuals produce comparably good or even better oﬀsprings. We will now study a nontrivial example which is wellknown in logistics—the traveling salesman problem (TSP). Assume we are given a ﬁnite set of vertices/cities {v1 , . . . , vN }. For every pair of cities (vi , vj ), the distance Di,j is
44
3. Analysis
known (i.e. we have a symmetric K × K distance matrix). What we want to ﬁnd is a permutation (p1 , . . . , pN ) such that the total way—the sum of distances—is minimal:
N −1
f (p) =
i=1
Dpi ,pi+1 + DpN ,p1
This problem appears in route planning, VLSI design, etc. For solving the TSP with a genetic algorithm, we need a coding, a crossover method, and a mutation method. All these three components should work together such the building block hypothesis is satisﬁable. First of all, it seems promising to encode a permutation as a string of integer numbers where entry no. i refers to the ith city which is visited. Since every number between 1 and K may only occur exactly once—otherwise we do not have a complete tour—the conventional onepoint crossover method is not inappropriate like all other methods we have considered. If we put aside mutation for a moment, the key problem remains how to deﬁne an appropriate crossover operation for the TSP. Partially Mapped Crossover Partially mapped crossover (PMX) aims at keeping as many positions from the parents as possible. To achieve this goal, a substring is swapped like in twopoint crossover and the values are kept in all other nonconﬂicting positions. The conﬂicting positions are replaced by the values which were swapped to the other oﬀspring. An example: p1 = (1 2 3 4 5 6 7 8 9) p2 = (4 5 2 1 8 7 6 9 3) Assume that positions 4–7 are selected for swapping. Then the two oﬀsprings are given as follows if we omit the conﬂicting positions: o1 = (* 2 31 8 7 6* 9) o2 = (* * 24 5 6 79 3) Now we take the conﬂicting positions and ﬁll in what was swapped to the other oﬀspring. For instance, 1 and 4 were swapped. Therefore, we have to replace the 1 in the ﬁrst position of o1 by 4, and so on: o1 = (4 2 3 1 8 7 6 5 9) o2 = (1 8 2 4 5 6 7 9 3)
3.2. Building Blocks and the Coding Problem Order Crossover
45
Order crossover (OX) relies on the idea that the order of cities is more important than their absolute positions in the strings. Like PMX, it swaps two aligned substrings. The computation of the remaining substrings of the oﬀsprings, however, is done in a diﬀerent way. In order to illustrate this rather simple idea, let us consider the same example (p1 , p2 ) as above. Simply swapping the two substrings and omitting all other positions, we obtain the following: o1 = (* * *1 8 7 6* *) o2 = (* * *4 5 6 7* *) For computing the open positions of o2 , let us write down the positions in p1 , but starting from the position after the second crossover site: 9 3 4 5 2 1 8 7 6 If we omit all those values which are already in the oﬀspring after the swapping operation (4, 5, 6, and 7), we end up in the following shortened list: 9 3 2 1 8 Now we insert this list into o2 starting after the second crossover site and we obtain o2 = (2 1 8 4 5 6 7 9 3). Applying the same technique to o1 produces the following result: o1 = (3 4 5 1 8 7 6 9 2). Cycle Crossover PMX and OX have in common that they usually introduce alleles outside the crossover sites which have not been present in either parent. For instance, the 3 in the ﬁrst position of o1 in the OX example above neither appears in p1 nor in p2 . Cycle crossover (CX) tries to overcome this problem—the goal is to guarantee that every string position in any oﬀspring comes from one of the two parents. We consider the following example: p1 = (1 2 3 4 5 6 7 8 9) p2 = (4 1 2 8 7 6 9 3 5)
46 We start from the ﬁrst position of o1 : o1 = (1 * * * * * * * *) o2 = (* * * * * * * * *)
3. Analysis
Then o2 may only have a 4 in the ﬁrst position, because we do not want new values to be introduced there: o1 = (1 * * * * * * * *) o2 = (4 * * * * * * * *) Since the 4 is already ﬁxed for o2 now, we have to keep it in the same position for o1 in order to guarantee that no new positions for the 4 are introduced. We have to keep the 8 in the fourth position of o2 automatically for the same reason: o1 = (1 * * 4 * * * * *) o2 = (4 * * 8 * * * * *) This process must be repeated until we end up in a value which have previously be considered, i.e. we have completed a cycle: o1 = (1 2 3 4 * * * 8 *) o2 = (4 1 2 8 * * * 3 *) For the second cycle, we can start with a value from p2 and insert it into o1 : o1 = (1 2 3 4 7 * * 8 *) o2 = (4 1 2 8 5 * * 3 *) After the same tedious computations, we end up with the following: o1 = (1 2 3 4 7 * 9 8 5) o2 = (4 1 2 8 5 * 7 3 9) The last cycle is a trivial one (6–6) and the ﬁnal oﬀsprings are given as follows: o1 = (1 2 3 4 7 6 9 8 5) o2 = (4 1 2 8 5 6 7 3 9) In case that the two parents form one single cycle, no crossover can take place. It is worth to mention that empirical studies have shown that OX gives 11% better results and PMX and 15 % better results than CX. In general, the performance of all three methods is rather poor.
3.2. Building Blocks and the Coding Problem A Coding with Reference List
47
Now we discuss an approach which modiﬁes the coding scheme such that all conventional crossover methods are applicable. It works as follows: A reference list is initialized with {1, . . . , N }. Starting from the ﬁrst position, we take the index of the actual element in the list which is then removed from the list. An example: p = (1 2 4 3 8 5 9 6 7) The ﬁrst element is 1 and its position in the reference list {1, . . . , 9} is 1. Hence, p = (1 * * * * * * * *). ˜ The next entry is 2 and its position in the remaining reference list {2, . . . , 9} is 1 and we can go further: p = (1 1 * * * * * * *). ˜ The third allele is 4 and its position in the remaining reference list {3, . . . , 9} is 2 and we obtain: p = (1 1 2 * * * * * *). ˜ It is left to the reader as an exercise to continue with this example. He/she will come to the conclusion that p = (1 1 2 1 4 1 3 1 1). ˜ The attentive reader might have guessed that a string in this coding is a valid permutation if and only if the following holds for all 1 ≤ i ≤ N : 1 ≤ pi ≤ N − i + 1 ˜ Since this criterion applies only to single string positions, completely independently from other positions, it can never be violated by any crossover method which we have discussed for binary strings. This is, without any doubt, a good property. The next example, however, drastically shows that onepoint crossover produces more or less random values behind the crossover site: p1 = (1 1 2 14 1 3 1 1) ˜ p2 = (5 1 5 55 3 3 2 1) ˜ o1 = (1 1 2 15 3 3 2 1) ˜ o2 = (5 1 5 54 1 3 1 1) ˜ p1 = (1 2 4 3 8 5 9 6 7) p2 = (5 1 7 8 9 6 4 3 2) o1 = (1 2 4 3 9 8 7 6 5) o2 = (5 1 7 8 6 2 9 4 3)
48 Edge Recombination
3. Analysis
Absolute string positions do not have any meaning at all—we may start a given roundtrip at a diﬀerent city and still observe the same total length. The order, as in OX, already has a greater importance. However, it is not order itself that makes a trip eﬃcient, it is the set of edges between cities, where it is obviously not important in which direction we pass such an edge. In this sense, the real building blocks in the TS problem are hidden in the connections between cities. A method called Edge Recombination (ER) rests upon this discovery. The basic idea is to cache information about all edges and to compute an oﬀspring from this edge list. We will study the basic principle with the help of a simple example: p1 = (1 2 3 4 5 6 7 8 9) p2 = (4 1 2 8 7 6 9 3 5) The ﬁrst thing is to compute all vertices which occur in the two parents. What we obtain is a list of 2–4 cities with which every city is connected: 1 2 3 4 5 6 7 8 9 → → → → → → → → → 2, 4, 9 1, 3, 8 2, 4, 5, 9 1, 3, 5 3, 4, 6 5, 7, 9 6, 8 2, 7, 9 1, 3, 6, 8
We start from the city with the lowest number of neighbors (7 in this example), put it into the oﬀspring, and erase it from all adjacency lists. From 7, we have two possibilities to move next—6 and 8. We always take the one with the smaller number of neighbors. If these numbers are equal, random selection takes place. This procedure must be repeated until the permutation is ready or a conﬂict occurs (no edges left, but permutation not yet complete). Empirical studies have shown that the probability not to run into a conﬂict is about 98%. This probability is high enough to have a good chance when trying it a second time. Continuing the example, the following oﬀspring could be obtained: o = (7 6 5 4 1 9 8 2 3)
3.3. Concluding Remarks
49
There are a few variants for improving the convergence of a GA with ER. First of all, it seems reasonable to mark all edges which occur in both parents and to favor them in the selection of the next neighbor. Moreover, it could be helpful to incorporate information about the lengths of single edges.
3.3
Concluding Remarks
In this chapter, we have collected several important results which provide valuable insight into the intrinsic principles of genetic algorithms. These insights were not given as hard mathematical results, but only as a loose collection of interpretations. In order to bring a structure into this mess, let us summarize our achievements: 1. Short, loworder schemata with aboveaverage ﬁtness (building blocks) receive an exponentially increasing number of trials. By the help of a detour to the twoarmed bandit problem, we have seen that this is a nearoptimal strategy. 2. Although a genetic algorithm only processes m structures at a time, it implicitly accumulates and exploits information about an enormous number of schemata in parallel. 3. We were tempted to believe that a genetic algorithm produces solutions by the juxtaposition of small eﬃcient parts—the building blocks. Our detailed considerations have shown, however, that this good property can only hold if the coding is chosen properly. One sophisticated example, the TSP, has shown how diﬃcult this can be for nontrivial problems.
50
3. Analysis
Chapter 4 Variants
Ich m¨chte aber behaupten, daß die Experimentiermethode der o Evolution gleichfalls einer Evolutions unterliegt. Es ist n¨mlich a nicht nur die momentane Lebensleistung eines Individuums f¨r u ¨ das Uberleben der Art wichtig; nach mehreren Generationen wird auch die bessere VererbungsStrategie, die eine schnellere Umweltanpassung zustandebringt, ausgelesen und weiterentwickelt. Ingo Rechenberg As Rechenberg pointed out correctly [25], the mechanisms behind evolution themselves are subject to evolution. The diversity and the stage of development of nature as we see it today would have never been achieved only with asexual reproduction. It is exactly the sophistication of genetic mechanisms which allowed faster and faster adaptation of genetic material. So far, we have only considered a very simple class of GAs. This chapter is intended to provide an overview of more sophisticated variants.
4.1
Messy Genetic Algorithms
In a “classical” genetic algorithm, the genes are encoded in a ﬁxed order. The meaning of a single gene is determined by its position inside the string. We have seen in the previous chapter that a genetic algorithm is likely to converge well if the optimization task can be divided into several short building blocks. What, however, happens if the coding is chosen such that couplings occur between distant genes? Of course, onepoint crossover tends to disadvantage long schemata (even if they have low order) over short ones. 51
52
4. Variants
Figure 4.1: A messy coding.
Figure 4.2: Positional preference: Genes with index 1 and 6 occur twice, the ﬁrst occurrences are used. Messy genetic algorithms try to overcome this diﬃculty by using a variablelength, positionindependent coding. The key idea is to append an index to each gene which allows to identify its position [14, 15]. A gene, therefore, is no longer represented as a single allele value and a ﬁxed position, but as a pair of an index and an allele. Figure 4.1 shows how this “messy” coding works for a string of length 6. Since the genes can be identiﬁed uniquely by the help of the index, genes may swapped arbitrarily without changing the meaning of the string. With appropriate genetic operations, which also change the order of the pairs, the GA could possibly group coupled genes together automatically. Due to the free arrangement of genes and the variable length of the encoding, we can, however, run into problems which do not occur in a simple GA. First of all, it can happen that there are two entries in a string which correspond to the same index, but have conﬂicting alleles. The most obvious way to overcome this “overspeciﬁcation” is positional preference—the ﬁrst entry which refers to a gene is taken. Figure 4.2 shows an example. The reader may have observed that the genes with indices 3 and 5 do not occur at all in the example in Figure 4.2. This problem of “underspeciﬁcation” is more complicated and its solution is not as obvious as for overspeciﬁcation. Of course, a lot of variants are reasonable. One approach could be to check all possible combinations and to take the best one (for k missing genes, there are 2k combinations). With the objective to reduce this eﬀort, Goldberg et al. [14] have suggested to use socalled competitive
!¥#© ¦ ¡£ $
"! § © £
¡
¦£ " $
§ £ ©
!¥£ ¦ ¡ $
£ § © ©
¡
#
¦£ $
§£¤¢ ¥
#
¦ £ ¨ © $
§ © £ ©
¦ £ ¡ §¡¥¤¢ $
§ £ ¨¥¦¤¡¢
¡
4.2. Alternative Selection Schemes
53
Figure 4.3: The cut and splice operation. There are 12 possible ways to splice the four parts. This example shows ﬁve of them. templates for ﬁnding speciﬁcations for k missing genes. It is nothing else than applying a local hill climbing method with random initial value to the k missing genes. While messy GAs usually work with the same mutation operator as simple GAs (every allele is altered with a low probability pM ), the crossover operator is replaced by a more general cut and splice operator which also allows to mate parents with diﬀerent lengths. The basic idea is to choose cut sites for both parents independently and to splice the four fragments. Figure 4.3 shows an example.
4.2
Alternative Selection Schemes
Depending on the actual problem, other selection schemes than the roulette wheel can be useful: Linear rank selection: In the beginning, the potentially good individuals sometimes ﬁll the population too fast which can lead to premature convergence into local maxima. On the other hand, reﬁnement in the end phase can be slow since the individuals have similar ﬁtness values. These problems can be overcome by taking the rank of the ﬁtness values as the basis for selection instead of the values themselves. Tournament selection: Closely related to problems above, it can be better
'
[email protected]( ©§¦¥©§!¥ ¨ ¨ #
[email protected]( ¨¥©§!¥ ¨ # 9B' ¨¥©§!¥©§¦¥ ¨ ¨ ' C9 " ©§¦¥©§!¥ ¨ ¨ ( D9 " ©§!¥ ¨
&%$¥©§!¥ ¨ ¨ ©§¦¥©§!¥ ¨ ¨ ©§¦¥©§!&%$¥ ¨ ¨ ¥ ¨ &%$¥©§¦¥ ¨ ¨ ¥ ¨ ¥©§¦¥ ¨ ¨
7865§4301) 2 ¥ ¨ ¥ ¨ ¥ ¨ ¨©!§©§¦¥ ¨©§¦¥ # " ¨ ¨ ¨ ¥ ¨ ©§!¥©¦§¥ ©!§&%$¥ ( '
£¡ ¤¢
54
4. Variants not to use the ﬁtness values themselves. In this scheme, a small group of individuals is sampled from the population and the individual with best ﬁtness is chosen for reproduction. This selection scheme is also applicable when the ﬁtness function is given in implicit form, i.e. when we only have a comparison relation which determines which of two given individuals is better.
Moreover, there is one “plugin” which is frequently used in conjunction with any of the three selection schemes we know so far—elitism. The idea is to avoid that the observed bestﬁtted individual dies out just by selecting it for the next generation without any random experiment. Elitism is widely used for speeding up the convergence of a GA. It should, however, be used with caution, because it can lead to premature convergence.
4.3
Adaptive Genetic Algorithms
Adaptive genetic algorithms are GAs whose parameters, such as the population size, the crossing over probability, or the mutation probability are varied while the GA is running (e.g. see [8]). A simple variant could be the following: The mutation rate is changed according to changes in the population; the longer the population does not improve, the higher the mutation rate is chosen. Vice versa, it is decreased again as soon as an improvement of the population occurs.
4.4
Hybrid Genetic Algorithms
As they use the ﬁtness function only in the selection step, genetic algorithms are blind optimizers which do not use any auxiliary information such as derivatives or other speciﬁc knowledge about the special structure of the objective function. If there is such knowledge, however, it is unwise and ineﬃcient not to make use of it. Several investigations have shown that a lot of synergism lies in the combination of genetic algorithms and conventional methods. The basic idea is to divide the optimization task into two complementary parts. The coarse, global optimization is done by the GA while local reﬁnement is done by the conventional method (e.g. gradientbased, hill climbing, greedy algorithm, simulated annealing, etc.). A number of variants is reasonable:
4.5. SelfOrganizing Genetic Algorithms
55
1. The GA performs coarse search ﬁrst. After the GA is completed, local reﬁnement is done. 2. The local method is integrated in the GA. For instance, every K generations, the population is doped with a locally optimal individual. 3. Both methods run in parallel: All individuals are continuously used as initial values for the local method. The locally optimized individuals are reimplanted into the current generation.
4.5
SelfOrganizing Genetic Algorithms
As already mentioned, the reproduction methods and the representations of the genetic material were adapted through the billions of years of evolution [25]. Many of these adaptations were able to increase the speed of adaptation of the individuals. We have seen several times that the choice of the coding method and the genetic operators is crucial for the convergence of a GA. Therefore, it is promising not to encode only the raw genetic information, but also some additional information, for example, parameters of the coding function or the genetic operators. If this is done properly, the GA could ﬁnd its own optimal way for representing and manipulating data automatically.
56
4. Variants
Chapter 5 Tuning of Fuzzy Systems Using Genetic Algorithms
There are two concepts within fuzzy logic which play a central role in its applications. The ﬁrst is that of a linguistic variable, that is, a variable whose values are words or sentences in a natural or synthetic language. The other is that of a fuzzy ifthen rule in which the antecedent and consequent are propositions containing linguistic variables. The essential function served by linguistic variables is that of granulation of variables and their dependencies. In effect, the use of linguistic variables and fuzzy ifthen rules results— through granulation—in soft data compression which exploits the tolerance for imprecision and uncertainty. In this respect, fuzzy logic mimics the crucial ability of the human mind to summarize data and focus on decisionrelevant information. Lotﬁ A. Zadeh Since it is not the main topic of this lecture, a detailed introduction to fuzzy systems is omitted here. We restrict ourselves to a few basic facts which are suﬃcient for understanding this chapter (the reader is referred to the literature for more information [20, 21, 27, 31]). The quotation above brilliantly expresses what the core of fuzzy systems is: Linguistic ifthen rules involving vague propositions (e.g. “large”, “small”, “old”, “around zero”, etc.). By this way, fuzzy systems allow reproducible automation of tasks for which no analytic model is known, but for which linguistic expert knowledge is available. Examples range from complicated chemical processes over power plant control, quality control, etc. 57
58
5. Tuning of Fuzzy Systems Using Genetic Algorithms
This sounds ﬁne at ﬁrst glance, but poses a few questions: How can such vague propositions be represented mathematically and how can we process them? The idea is simple but eﬀective: Such vague assertions are modeled by means of socalled fuzzy sets, i.e. sets which can have intermediate degrees of membership (the unit interval [0, 1] is usually taken as the domain of membership degrees). By this way, it is possible to model concepts like “tall men” which can never be represented in classical set theory without drawing ambiguous, counterintuitive boundaries. In order to summarize, there are three essential components of fuzzy systems: 1. The rules, i.e. a verbal description of the relationships. 2. The fuzzy sets (membership functions), i.e. the semantics of the vague expressions used in the rules. 3. An inference machine, i.e. a mathematical methodology for processing a given input through the rule base. Since this is not a major concern in this lecture, let us assume that a reasonable inference scheme is given. There are still two important components left which have to be speciﬁed in order to make a fuzzy system work—the rules and the fuzzy sets. In many cases, they can both be found simply by using common sense (some consider fuzzy systems as nothing else than a mathematical model of common sense knowledge). In most problems, however, there is only an incomplete or inexact description of the automation task. Therefore, researchers have begun soon to investigate methods for ﬁnding or optimizing the parameters of fuzzy systems. So far, we can distinguish between the following three fundamental learning tasks: 1. The rules are given, but the fuzzy sets are unknown at all and must be found or, what happens more often, they can only be estimated und need to be optimized. A typical example would be the following: The rules for driving a car are taught in every driving school, e.g. “for starting a car, let in the clutch gently and, simultaneously, step on the gas carefully”, but the beginner must learn from practical experience what “letting in the clutch gently” actually means. 2. The semantical interpretation of the rules is known suﬃciently well, but the relationships between input and output, i.e. the rules, are not known. A typical example is extracting certain risk factors from patient data. In this case, it is suﬃciently known which blood pressures are
5.1. Tuning of Fuzzy Sets
59
high and which are low, but the factors, which really inﬂuence the risk of getting a certain disease, are unknown. 3. Nothing is known, both fuzzy sets and rules must be acquired, for instance, from sample data.
5.1
Tuning of Fuzzy Sets
Let us start with the ﬁrst learning task—how to ﬁnd optimal conﬁgurations of fuzzy sets. In Chapter 2, we have presented a universal algorithm for solving a very general class of optimization problems. We will now study how such a simple GA can be applied to the optimization of fuzzy sets. All we need is an appropriate coding, genetic operators (in case that the standard variants are not suﬃcient), and a ﬁtness measure.
5.1.1
Coding Fuzzy Subsets of an Interval
Since this is by far the most important case in applications of fuzzy systems, let us restrict to fuzzy subsets of a given real interval [a, b]. Of course, we will never be able to ﬁnd a coding which accommodates any possible fuzzy set. It is usual in applications to ﬁx a certain subclass which can be represented by means of a ﬁnite set of parameters. Descriptions of such fuzzy sets can then be encoded by coding these parameters. The ﬁrst class we mention here are piecewise linear membership functions with a ﬁxed set of grid points (a = x0 , x1 , . . . , xn−1 , xn = b), an equally spaced grid in the simplest case. Popular fuzzy control software tools like fuzzyTECH or TILShell use this technique for their internal representations of fuzzy sets. It is easy to see that the shape of the membership function is uniquely determined by the membership degrees in the grid points (see Figure 5.1 for an example). Therefore, we can simply encode such a fuzzy set by putting codings of all these membership values in one large string: cn,[0,1] (µ(x0 )) cn,[0,1] (µ(x1 )) · · · cn,[0,1] (µ(xn ))
A reasonable resolution for encoding the membership degrees is n = 8. Such an 8bit coding is used in several software systems, too. For most problems, however, simpler representations of fuzzy sets are sufﬁcient. Many realworld applications use triangular and trapezoidal membership functions (cf. Figure 5.2). Not really surprising, a triangular fuzzy set can be encoded as
60
5. Tuning of Fuzzy Systems Using Genetic Algorithms cn,[a,b] (r) cn,[0,δ] (u) cn,[0,δ] (v) ,
where δ is an upper boundary for the size of the oﬀsets, for example δ = (b − a)/2. The same can be done analogously for trapezoidal fuzzy sets: cn,[a,b] (r) cn,[0,δ] (q) cn,[0,δ] (u) cn,[0,δ] (v) .
In speciﬁc control applications, where the smoothness of the control surface plays an important role, fuzzy sets of higher diﬀerentiability must be used. The most prominent representative is the bellshaped fuzzy set whose membership function is given by a Gaussian bell function: µ(x) = e−
(x−r)2 2u2
The “bellshaped analogue” to trapezoidal fuzzy sets are socalled radial basis functions: (x−r−q)2 e− 2u2 if x − r > q µ(x) = 1 if x − r ≤ q Figure 5.3 shows a typical bellshaped membership function. Again the coding method is straightforward, i.e. cn,[a,b] (r) cn,[ε,δ] (u)
where ε is a lower limit for the spread u. Analogously for radial basis functions: cn,[a,b] (r) cn,[0,δ] (q) cn,[ε,δ] (u)
The ﬁnal step is simple and obvious: In order to deﬁne a coding of the whole conﬁguration, i.e. of all fuzzy sets involved, it is suﬃcient to put codings of all relevant fuzzy sets into one large string.
5.1.2
Coding Whole Fuzzy Partitions
There is often apriori knowledge about the approximate conﬁguration, for instance, something like an ordering of the fuzzy sets. A general method, which encodes all fuzzy sets belonging to one linguistic variable independently like above, yields an unnecessarily large search space. A typical situation, not only in control applications, is that we have a certain number of fuzzy sets
5.1. Tuning of Fuzzy Sets
61
1 0.8 0.6 0.4 0.2
a=x0
x1
x2
x3
x4
x5=b
Figure 5.1: Piecewise linear membership function with ﬁxed grid points.
1 0.8 0.6 0.4 0.2 a ru r r+v b
1 0.8 0.6 0.4 0.2 a ru r r+q r+q+v b
Figure 5.2: Simple fuzzy sets with piecewise linear membership functions (triangular left, trapezoidal right).
1 0.8 0.6 0.4 0.2 a ru r r+u b
1 0.8 0.6 0.4 0.2 a rqu rq r r+q r+q+u b
Figure 5.3: Simple fuzzy sets with smooth membership functions (bellshaped left, radial basis right).
62
5. Tuning of Fuzzy Systems Using Genetic Algorithms
with labels, like “small”, “medium”, and “large” or “negative big”, “negative medium”, “negative small”, “approximately zero”, “positive small”, “positive medium”, and “positive big”. In such a case, we have a natural ordering of the fuzzy sets. By including appropriate constraints, the ordering of the fuzzy sets can be preserved while reducing the number of degrees of freedom. We will now study a simple example—an increasing sequence of trapezoidal fuzzy sets. Such a “fuzzy partition” is uniquely determined by an increasing sequence of 2N points, where N is the number of linguistic values we consider. The mathematical formulation is the following: if x ∈ [x0 , x1 ] 1 x2 −x if x ∈ (x1 , x2 ) µ1 (x) = x2 −x1 0 otherwise x−x2i−3 x2i−2 −x2i−3 if x ∈ (x2i−3 , x2i−2 ) 1 if x ∈ [x2i−2 , x2i−1 ] µi (x) = for 2 ≤ i ≤ N − 1 x2i −x if x ∈ (x2i , x2i−1 ) x2i −x2i−1 0 otherwise x−x2N −3 x2N −2 −x2N −3 if x ∈ (x2N −3 , x2N −2 ) µN (x) = 1 if x ≥ x2N −2 0 otherwise Figure 5.4 shows a typical example with N = 4. It is not wise to encode the values xi as they are, since this requires constraints for ensuring that xi are nondecreasing. A good alternative is to encode the oﬀsets: cn,[0,δ] (x1 ) cn,[0,δ] (x2 − x1 ) · · · cn,[0,δ] (x2N −2 − x2N −3 )
5.1.3
Standard Fitness Functions
Although it is impossible to formulate a general recipe which ﬁts for all kinds of applications, there is one important standard situation—the case where a set of representative inputoutput examples is given. Assume that F (v, x) is the function which computes the output for a given input x with respect to the parameter vector v. Example data is given as a list of couples (xi , yi ) with 1 ≤ i ≤ K (K is the number of data samples). Obviously, the goal is to ﬁnd a parameter conﬁguration v such that the corresponding outputs F (v, xi ) match the sample outputs yi as well as possible. This can be achieved by
5.1. Tuning of Fuzzy Sets
1 0.8 0.6 0.4 0.2 a=x0 x1 x2 x3 x4 x5 x6 x7=b
63
Figure 5.4: A fuzzy partition with N = 4 trapezoidal parts. minimizing the error function
K
f (v) =
i=1
d F (v, xi ), yi ,
where d(., .) is some distance measure deﬁned on the output space. In case that the output consists of real numbers, one prominent example is the wellknown sum of quadratic errors:
K
f (v) =
i=1
F (v, xi ) − yi
2
5.1.4
Genetic Operators
Since we have only dealt with binary representations of fuzzy sets and partitions, all the operators from Chapter 2 are also applicable here. We should be aware, however, that the oﬀset encoding of fuzzy partitions is highly epistatic. More speciﬁcally, if the ﬁrst bit encoding x1 is changed, the whole partition is shifted. If this results in bad convergence, the crossover operator should be modiﬁed. A suggestion can be found, for instance, in [3]. Figure 5.5 shows an example what happens if two fuzzy partitions are crossed with normal onepoint crossover. Figure 5.6 shows the same for bitwise mutation.
64
5. Tuning of Fuzzy Systems Using Genetic Algorithms
¥ ¡ £¤ ¦ ¢¥ ¡ £¤ ¦ ¢¥ ¢¥ ¨§ £¤
262
¥ ¢¥ ¢¥ ¢¥ ¢ ¨ £¤ ¡¡ £¤ ¡ £¤ ¦ ¡§ £¤
0.4 0.2 0.8 0.6
0.8
¢
¥ ¢¥ ¥¢ ¢¥ ¡¡
0 0.2 0.4 0.8 0.6
¡ £¤ ¦
¥ ¢¥
260
0.8 0.6 0.4 0.2 1 0 330
¡§ £¤
¡ £¤ ¦ ¥¢ ¡© £¤ ¢¥ ¨ £¤ ¢ ¡¡
0.6
0.4
0.2
1
1
£¤
0
¡ £¤ ¦
¡¦ £¤
¢
Figure 5.5: Example for onepoint crossover of fuzzy partitions.
1
£¤
100
150
200 230 280
200
0
¡¡
100
162
200 218 268
212
328
5.2. A Practical Example
65
1 0.8 0.6 0.4 0.2 0 100 150 200 260
1 0.8 0.6 0.4 0.2 0 228 278 328 388
Figure 5.6: Mutating a fuzzy partition.
5.2
A Practical Example
Pixel classiﬁcation is an important preprocessing task in many image processing applications. In this project, where the FLLL developed an inspection system for a silkscreen printing process, it was necessary to extract regions from the print image which had to be checked by applying diﬀerent criteria: Homogeneous area: Uniformly colored area; Edge area: Pixels within or close to visually signiﬁcant edges; Halftone: Area which looks rather homogeneous from a certain distance, although it is actually obtained by printing small raster dots of two or more colors; Picture: Rastered area with high, chaotic deviations, in particular small highcontrasted details.
¨
!"#!"!"! ! ¨ ¦ ¥ ¤¨ ¦ ¥ ¤¨ ¦ ¥ ¤¨ ¦ §¥ ¤ ¡ £ ¡ © ¡ © ££ $
!"#!"!"! ! ¦ ¥ ¤¨ ¦ ¥ ¤¨ ¦ ¥ ¤¨ ¦ §¥ ¤ $ ¡ £ ¡ © ¡ © ¡ ¡ £¢
66
5. Tuning of Fuzzy Systems Using Genetic Algorithms
Homogeneous
Edge
Halftone
Picture
Figure 5.7: Magniﬁcations of typical representatives of the four types of pixels. k 1 2 3 4 5 6 7 8 l(k) ( i ,j − 1) ( i − 1 ,j − 1) (i−1, j ) ( i − 1 ,j + 1) ( i ,j + 1) ( i + 1 ,j + 1) (i+1, j ) ( i + 1 ,j + 1)
2 u 1 u 8 u
3 u (i,u j) 7 u
4 u 5 u 6 u
Figure 5.8: Clockwise enumeration of neighbor pixels. The magniﬁcations in Figure 5.7 show how these areas typically look like at the pixel level. Of course, transitions between two or more of these areas are possible, hence a fuzzy model is recommendable. If we plot the gray values of the eight neighbor pixels according to a clockwise enumeration (cf. Figure 5.8), we typically get curves like those shown in Figure 5.9. Seemingly, the size of the deviations, e.g. by computing the variance, can be used to distinguish between homogeneous areas, halftones and the other two types. On the other hand, a method which judges the width and connectedness of the peaks should be used in order to separate edge areas from pictures. A simple but eﬀective method for this purpose is the socalled discrepancy norm, for which there are already other applications in pattern recognition (cf. [22]):
β
x
D
=
1≤α≤β≤n
max
xi
i=α
5.2. A Practical Example
67
Figure 5.9: Typical gray value curves corresponding to the four types. A more detailed analysis of the discrepancy norm, especially how it can be computed in linear time, can be found in [2].
5.2.1
The Fuzzy System
For each pixel (i, j), we consider its nearest eight neighbors enumerated as described above, which yields a vector of eight gray values. As already mentioned, the variance of the gray value vector can be taken as a measure for the size of the deviations in the neighborhood of the pixel. Let us denote this value with v(i, j). On the other hand, the discrepancy norm of the vector, where we subtract each entry by the mean value of all entries, can be used as a criterion whether the pixel is within or close to a visually signiﬁcant edge (let us call this value e(i, j) in the following). The fuzzy decision is then carried out for each pixel (i, j) independently: First of all, the characteristic values v(i, j) and e(i, j) are computed. These values are taken as the input of a small fuzzy system with two inputs and one output. Let us denote the linguistic variables on the input side with v and e. Since the position of the pixel is of no relevance for the decision in this concrete application, indices can be omitted here. The input space of the variable v is represented by three fuzzy sets which are labeled “low”, “med”, and “high”. Analogously, the input space of the variable e is represented by two fuzzy sets, which are labeled “low” and “high”. Experiments have shown that [0, 600] and [0, 200] are appropriate domains for v and e, respectively. For the decomposition of the input domains, simple fuzzy partitions consisting of trapezoidal fuzzy subsets were chosen. Figure 5.10 shows how these partitions basically look like. The output space is a set of linguistic labels, namely “Ho”, “Ed”, “Ha”, and “Pi”, which are, of course, just abbreviations of the names of the four types. Let us denote the output variable itself with t. Finally, the output of
¡ £
9#765 42 8 ( 3
¢
¡ £
(&$ 1¥0)' #%¤
¢
¡ £
##!"
¢
¡ £
¨©¨¦¤ ¥ ¥ § ¥
¢
68
5. Tuning of Fuzzy Systems Using Genetic Algorithms
low 6
high med \ \ \ \ \ \ \ \ v1 v2 v3 v4
low 6 l , , l, ,l l l , e1 e2
high

Figure 5.10: The linguistic variables v and e. the system for each pixel (i, j) is a fuzzy subset of {“Ho”, “Ed”, “Ha”, “Pi”}. This output set is computed by processing the values v(i, j) and e(i, j) through a rule base with ﬁve rules, which cover all the possible combinations:
IF IF IF IF IF v v v v v is is is is is low med high med high AND AND AND AND e e e e is is is is high high low low THEN THEN THEN THEN THEN t t t t t = = = = = Ho Ed Ed Ha Pi
In this application, ordinary Mamdani min/maxinference is used. Finally, the degree to which “Ho”, “Ed”, “Ha”, or “Pi” belong to the output set can be regarded as the degree to which the particular pixel belongs to area Homogeneous, Edge, Halftone, or Picture, respectively.
5.2.2
The Optimization of the Classiﬁcation System
The behavior of the fuzzy system depends on six parameters, v1 , . . . , v4 , e1 , and e2 , which determine the shape of the two fuzzy partitions. In the ﬁrst step, these parameters were tuned manually. Of course, we have also taken into consideration to use (semi)automatic methods for ﬁnding the optimal parameters. Our optimization procedure consists of a painting program which oﬀers tools, such as a pencil, a rubber, a ﬁlling algorithm, and many more. This painting tool can be used to make a reference classiﬁcation of a given representative image by hand. Then an optimization algorithm can be used to ﬁnd that conﬁguration of parameters which yields the maximal degree of matching between the desired result and the output actually obtained by the classiﬁcation system. Assume that we have a set of N sample pixels for which the input values (˜k , ek )k∈{1,...,N } are computed and that we already have a reference classiﬁv ˜
5.2. A Practical Example cation of these pixels ˜ ˜ ˜ ˜ ˜ t(k) = (tHo (k), tEd (k), tHa (k), tPi (k)),
69
where k ∈ {1, . . . , N }. Since, as soon as the values v and e are computed, ˜ ˜ the geometry of the image plays no role anymore, we can switch to onedimensional indices here. One possible way to deﬁne the performance (ﬁtness) of the fuzzy system would be 1 N
N
˜ d(t(k), t(k)),
k=1
(5.1)
where t(k) = (tHo (k), tEd (k), tHa (k), tPi (k)) are the classiﬁcations actually obtained by the fuzzy system for the input pairs (˜k , ek ) with respect to the v ˜ parameters v1 , v2 , v3 , v4 , e1 , and e2 ; d(., .) is an arbitrary (pseudo)metric on [0, 1]4 . The problem of this brute force approach is that the output of the fuzzy system has to be evaluated for each pair (vk , ek ), even if many of these values are similar or even equal. In order to keep the amount of computation low, we “simpliﬁed” the procedure by a “clustering process” as follows: We choose a partition (P1 , . . . , PK ) of the input space, where (n1 , . . . , nK ) are the numbers of sample points {pi , . . . , pi i } each part contains. Then the 1 n desired classiﬁcation of a certain part (cluster) can be deﬁned as 1 ˜ tX (Pi ) = ni where X ∈ {Ho, Ed, Ha, Pi}. If φ is a function which maps each cluster to a representative value (e.g., its center of gravity), we can deﬁne the ﬁtness (objective) function as K 100 1 2 ˜ n i · 1 − · tX (Pi ) − tX (φ(Pi )) , (5.2) N i=1 2
X∈{Ho,Ed,Ha,Pi} ni
˜ j tX (pi ),
j=1
If the number of parts is chosen moderately (e.g. a rectangular 64 × 32 net which yields K = 2048) the evaluation of the ﬁtness function takes considerably less time than a direct application of formula (5.1). Note that in (5.2) the ﬁtness is already transformed such that it can be regarded as a degree of matching between the desired and the actually obtained classiﬁcation measured in percent. This value has to be maximized.
70
5. Tuning of Fuzzy Systems Using Genetic Algorithms
83.5 86 83.4 85 83.3 83.2 83.1 83 82 0 50 100 150 200 0 20 40 60 80 100 120 140 84
83
Figure 5.11: Cross sections of a function of type (5.2). In fact, ﬁtness functions of this type are, in almost all cases, continuous but not diﬀerentiable and have a lot of local maxima. Figure 5.11 shows cross sections of such functions. Therefore, it is more reasonable rather to use probabilistic optimization algorithms than to apply continuous optimization methods, which make excessive use of derivatives. This, ﬁrst of all, requires a (binary) coding of the parameters. We decided to use a coding which maps the parameters v1 , v2 , v3 , v4 , e1 , and e2 to a string of six 8bit integers s1 , . . . , s6 which range from 0 to 255. The following table shows how the encoding and decoding is done:
s1 s2 s3 s4 s5 s6 = = = = = = v1 v2 − v1 v3 − v2 v4 − v3 e1 e2 − e1 v1 v2 v3 v4 e1 e2 = = = = = = s1 s1 + s2 s1 + s2 + s3 s1 + s2 + s3 + s4 s5 s5 + s6
We ﬁrst tried a simple GA with standard roulette wheel selection, onepoint crossover with uniform selection of the crossing point, and bitwise mutation. The length of the strings was, as shown above, 48. In order to compare the performance of the GAs with other wellknown probabilistic optimization methods, we additionally considered the following methods: Hill climbing: always moves to the bestﬁtted neighbor of the current string until a local maximum is reached; the initial string is generated randomly. Simulated annealing: powerful, oftenused probabilistic method which is based on the imitation of the solidiﬁcation of a crystal under slowly decreasing temperature
5.2. A Practical Example
71
fmax Hill Climbing Simulated Annealing Improved Simulated Annealing GA Hybrid GA (elite) Hybrid GA (random) 94.3659 94.3648 94.3773 94.3760 94.3760 94.3776
fmin 89.6629 89.6625 93.7056 93.5927 93.6299 94.3362
¯ f 93.5536 93.5639 94.2697 94.2485 94.2775 94.3693
σf 1.106 1.390 0.229 0.218 0.207 0.009
It 862 1510 21968 9910 7460 18631
Figure 5.12: A comparison of results obtained by several diﬀerent optimization methods. Each one of these methods requires only a few binary operations in each step. Most of the time is consumed by the evaluation of the ﬁtness function. So, it is near at hand to take the number of evaluations as a measure for the speed of the algorithms. Results All these algorithms are probabilistic methods; therefore, their results are not welldetermined, they can diﬀer randomly within certain boundaries. In order to get more information about their average behavior, we tried out each one of them 20 times for one certain problem. For the given problem, we found out that the maximal degree of matching between the reference classiﬁcation and the classiﬁcation actually obtained by the fuzzy system was 94.3776%. In the table in Figure 5.12, fmax is the ﬁtness of the best and ¯ fmin is the ﬁtness of the worst solution; f denotes the average ﬁtness of the 20 solutions, σf denotes the standard deviation of the ﬁtness values of the 20 solutions, and # stands for the average number of evaluations of the ﬁtness function which was necessary to reach the solution. The hill climbing method with random selection of the initial string converged rather quickly. Unfortunately, it was always trapped in a local maximum, but never reached the global solution (at least in these 20 trials).
72
5. Tuning of Fuzzy Systems Using Genetic Algorithms
¡
¡
¤
¤
©
¥ £
©
¥
£
¦¥
¤
¦
¨
¢
£
¥
¤
¨§
¢
£
@
"
@
"
71
' (
3
(
$
6
2
2
9
9
'
1 %
&
&
0
0
0
)
( 7 7
2
2
¡¥
#
#
'
$ ¤
8 ¨ 8
¢
&'
£
%
$
6
'
$
$
%
0
7
$
(
'
6
6
¦§
5
¢¤ '
$
$ %
"
$ 4 £
# 0
7
7
(
"
'
C
6
'
!
5 2
1
B
"
A"
4
¡¥
¡
¢¤
£
¡
¡
¡
¡
¢
¡
¡
¡
¤ ¤
©
¥ £
©
¥
£
¤
Figure 5.13: A graphical representation of the results.
5.3. Finding Rule Bases with GAs
73
The simulated annealing algorithm showed similar behavior at the very beginning. After tuning the parameters involved, the performance improved remarkably. The raw genetic algorithm was implemented with a population size of 20; the crossover probability was set to 0.85, the mutation probability was 0.005 for each byte. It behaved pretty well from the beginning, but it seemed inferior to the improved simulated annealing. Next, we tried a hybrid GA, where we kept the genetic operations and parameters of the raw GA, but every 50th generation the bestﬁtted individual was taken as initial string for a hill climbing method. Although the performance increased slightly, the hybrid method still seemed to be worse than the improved simulated annealing algorithm. The reason that the eﬀects of this modiﬁcation were not so dramatic might be that the probability is rather high that the best individual is already a local maximum. So we modiﬁed the procedure again. This time a randomly chosen individual of every 25th generation was used as initial solution of the hill climbing method. The result exceeded the expectations by far. The algorithm was, in all cases, nearer to the global solution than the improved simulated annealing (compare with table in Figure 5.12), but, surprisingly, suﬃced with less invocations of the ﬁtness function. The graph in Figure 5.13 shows the results graphically. Each line in this graph corresponds to one algorithm. The curve shows, for a given ﬁtness value x, how many of the 20 diﬀerent solutions had a ﬁtness higher or equal to x. It can be seen easily from this graph that the hybrid GA with random selection led to the best results. Note that the xaxis is not a linear scale in this ﬁgure. It was transformed in order to make small diﬀerences visible.
5.2.3
Concluding Remarks
In this example, we have investigated the suitability of genetic algorithms for ﬁnding the optimal parameters of a fuzzy system, especially if the analytical properties of the objective function are bad. Moreover, hybridization has been discovered as an enormous potential for improvements of genetic algorithms.
5.3
Finding Rule Bases with GAs
Now let us brieﬂy turn to the second learning problem from Page 58. If we ﬁnd a method for encoding a rule base into a string of a ﬁxed length, all
74
5. Tuning of Fuzzy Systems Using Genetic Algorithms
the genetic methods we have dealt with so far, are applicable with only little modiﬁcations. Of course, we have to assume in this case that the numbers of linguistic values of all linguistic variables are ﬁnite. The simplest case is that of coding a complete rule base which covers all the possible cases. Such a rule base is represented as a list for one input variable, as a matrix for two variables, and as a tensor in the case of more than two input variables. For example, consider a rule base of the following form (the generalization to more than two input variable is straightforward): ˜ IF x1 is Ai AND x2 is Bj THEN y is Ci,j Ai and Bj are verbal values of the variables x1 and x2 , respectively. All the values Ai are pairwise diﬀerent, analogously for the values Bj ; i ranges from 1 to N1 , the total number of linguistic values of variable x1 ; j ranges from 1 to N2 , the total number of linguistic values of variable x2 . The values ˜ Ci,j are arbitrary elements of the set of pairwise diﬀerent linguistic values {C1 , . . . , CNy } associated with the output variable y. Obviously, such a rule base is uniquely represented by a matrix, a socalled decision table: A1 . . . AN 1 B1 ˜ C1,1 . . . ˜N1 ,1 C ··· ··· .. . BN2 ˜ C1,N2 . . . ˜N1 ,N2 ··· C
˜ Of course, the representation is still unique if we replace the values Ci,j by their unique indices within the set {C1 , . . . , CNy } and we have found a proper coding scheme for tablebased rule bases. 5.1 Example. Consider a fuzzy system with two inputs (x1 and x2 ) and one output y. The domains of all three variables are divided into four fuzzy sets labeled “small”, “medium”, “large”, and “very large” (for short, abbreviated “S”, “M”, “L”, and “V”). We will now study how the following decision table can be encoded into a string: S M L V S S S S M M S S M L L S M L V V M L V V
For example, the third entry “M” in the second row reads as follows: IF x1 is medium AND x2 is large THEN y is medium
5.3. Finding Rule Bases with GAs
75
If we assign indices ranging from 0–3 to the four linguistic values associated with the output variable y, we can write the decision table as an integer string with length 16: (0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 3) Replacing the integer values with twobit binary strings, we obtain a 32bit binary string which uniquely describes the above decision table: (00000001000001100001101101101111) For the method above, genetic algorithms of type 2.5 are perfectly suitable. Of course, the ﬁtness functions, which we have introduced in 5.1.3, can also be used without any modiﬁcations. It is easy to see that the approach above works consequentoriented, meaning that the premises are ﬁxed—only the consequent values must be acquired. Such an idea can only be applied to optimization of complete rule bases which are, in more complex applications, not so easy to handle. Moreover, complete rule bases are often an overkill and require a lot of storage resources. In many applications, especially in control, it is enough to have an incomplete rule base consisting of a certain number of rules which cover the input space suﬃciently well. The acquisition of incomplete rule bases is a task, which is not so easy to solve with representations of ﬁxed length. We will come to this point a little later.
76
5. Tuning of Fuzzy Systems Using Genetic Algorithms
Chapter 6 Genetic Programming
How can computers learn to solve problems without being explicitly programmed? In other words, how can computers be made to do what is needed to be done, without being told explicitly how to do it? John R. Koza Mathematicians and computer scientists, in their everyday practice, do nothing else than searching for programs which solve given problems properly. They usually try to design such programs based on their knowledge of the problem, its underlying principles, mathematical models, their intuition, etc. Koza’s questions seem somehow provocative and utopian. His answers, however, are remarkable and worth to be discussed here in more detail. The basic idea is simple but appealing—to apply genetic algorithms to the problem of automatic program induction. All we need in order to do so are appropriate modiﬁcations of all genetic operations we have discussed so far. This includes random initialization, crossover, and mutation. For selection and sampling, we do not necessarily need anything new, because these methods are independent of the underlying data representation. Of course, this sounds great. The question arises, however, whether this kind of Genetic Programming (GP) can work at all. Koza, in his remarkable monograph [19], starts with a rather vague hypothesis. 6.1 The Genetic Programming Paradigm. Provided that we are given a solvable problem, a deﬁnition of an appropriate programming language, and a suﬃciently large set of representative test examples (correct inputoutput pairs), a genetic algorithm is able to ﬁnd a program which (approximately) solves the problem. 77
78
6. Genetic Programming
This seems to be a matter of believe. Nobody has been able to prove this hypothesis so far and it is doubtable whether this will ever be possible. Instead of giving a proof, Koza has elaborated a large set of wellchosen examples which underline his hypothesis empirically. The problems he has solved successfully with GP include the following: • Process control (bang bang control of inverted pendulum) • Logistics (simple robot control, stacking problems) • Automatic programming (pseudorandom number generators, prime number program, ANN design) • Game strategies (Poker, Tic Tac Toe) • Inverse kinematics • Classiﬁcation • Symbolic computation: – Sequence induction (Fibonacci sequence, etc.) – Symbolic regression – Solving equations (power seriesbased solutions of functional, differential, and integral equations) – Symbolic diﬀerentiation and integration – Automatic discovery of trigonometric identities This chapter is devoted to a brief introduction to genetic programming. We will restrict ourselves to the basic methodological issues and omit to elaborate examples in detail. For a nice example, the reader is referred to a [12].
6.1
Data Representation
Without any doubt, programs can be considered as strings. There are, however, two important limitations which make it impossible to use the representations and operations from our simple GA: 1. It is mostly inappropriate to assume a ﬁxed length of programs.
6.1. Data Representation
79
+
*
SIN
3
X
+
X
1
Figure 6.1: The tree representation of (+ (* 3 X) (SIN (+ X 1))). 2. The probability to obtain syntactically correct programs when applying our simple initialization, crossover, and mutation procedures is hopelessly low. It is, therefore, indispensable to modify the data representation and the operations such that syntactical correctness is easier to guarantee. The common approach to represent programs in genetic programming is to consider programs as trees. By doing so, initialization can be done recursively, crossover can be done by exchanging subtrees, and random replacement of subtrees can serve as mutation operation. Since their only construct are nested lists, programs in LISPlike languages already have a kind of treelike structure. Figure 6.1 shows an example how the function 3x + sin(x + 1) can be implemented in a LISPlike language and how such a LISPlike function can be split up into a tree. Obviously, the tree representation directly corresponds to the nested lists the program consists of; atomic expressions, like variables and constants, are leaves while functions correspond to nonleave nodes. There is one important disadvantage of the LISP approach—it is diﬃcult to introduce type checking. In case of a purely numeric function like in the above example, there is no problem at all. However, it can be desirable to process numeric data, strings, and logical expressions simultaneously. This is diﬃcult to handle if we use a tree representation like in Figure 6.1. A very general approach, which overcomes this problem allowing maximum ﬂexibility, has been proposed by A. GeyerSchulz. He suggested to represent programs by their syntactical derivation trees with respect to a recursive deﬁnition of underlying language in BackusNaur Form (BNF) [10].
80
6. Genetic Programming
This works for any contextfree language. It is far beyond the scope of this lecture to go into much detail about formal languages. We will explain the basics with the help of a simple example. Consider the following language which is suitable for implementing binary logical expressions: S exp var neg bin := exp ; := var  “(” neg exp “)”  “(” exp := “x”  “y”; := “NOT” ; := “AND”  “OR” ;
bin
exp “)” ;
The BNF description consists of socalled syntactical rules. Symbols in angular brackets are called nonterminal symbols, i.e. symbols which have to be expanded. Symbols between quotation marks are called terminal symbols, i.e. they cannot be expanded any further. The ﬁrst rule S:= exp ; deﬁnes the starting symbol. A BNF rule of the general shape nonterminal := deriv1  deriv2  · · ·  derivn ;
deﬁnes how a nonterminal symbol may be expanded, where the diﬀerent variants are separated by vertical bars. In order to get a feeling how to work with the BNF grammar description, we will now show step by step how the expression (NOT (x OR y)) can be derivated from the above language. For simplicity, we omit quotation marks for the terminal symbols: 1. We have to begin with the start symbol: exp 2. We replace exp with the second possible derivation: exp −→ ( neg exp )
3. The symbol neg may only be expanded with the terminal symbol NOT: ( neg exp ) −→ (NOT exp )
4. Next, we replace exp with the third possible derivation: (NOT exp ) −→ (NOT ( exp bin exp ))
5. We expand the second possible derivation for bin : (NOT ( exp bin exp )) −→ (NOT ( exp OR exp ))
6.1. Data Representation
81
<exp>
2nd of 3 possible derivations
"(" <neg> <exp> ")"
1st of 1 3rd of 3 possible derivations
"NOT"
"(" <exp> <bin> <exp> ")"
1st of 3 2nd of 2 1st of 3
<var>
1st of 2
"OR"
<var>
2nd of 2
"x"
"y"
Figure 6.2: The derivation tree of (NOT (x OR y)). 6. The ﬁrst occurrence of exp is expanded with the ﬁrst derivation: (NOT ( exp OR exp )) −→ (NOT ( var OR exp )) 7. The second occurrence of exp is expanded with the ﬁrst derivation, too: (NOT ( var OR exp )) −→ (NOT ( var OR var )) 8. Now we replace the ﬁrst var with the corresponding ﬁrst alternative: (NOT ( var OR var )) −→ (NOT (x OR var )) 9. Finally, the last nonterminal symbol is expanded with the second alternative: (NOT (x OR var )) −→ (NOT (x OR y)) Such a recursive derivation has an inherent tree structure. For the above example, this derivation tree has been visualized in Figure 6.2.
6.1.1
The Choice of the Programming Language
The syntax of modern programming languages can be speciﬁed in BNF. Hence, our data model would be applicable to all of them. The question
82
6. Genetic Programming
is whether this is useful. Koza’s hypothesis includes that the programming language has to be chosen such that the given problem is solvable. This does not necessarily imply that we have to choose the language such that virtually any solvable problem can be solved. It is obvious that the size of the search space grows with the complexity of the language. We know that the size of the search space inﬂuences the performance of a genetic algorithm—the larger the slower. It is, therefore, recommendable to restrict the language to necessary constructs and to avoid superﬂuous constructs. Assume, for example, that we want to do symbolic regression, but we are only interested in polynomials with integer coeﬃcients. For such an application, it would be an overkill to introduce rational constants or to include exponential functions in the language. A good choice could be the following: S func var const int bin := func ; := var  const  “(” func := “x” ; := int  const int ; := “0”  · · ·  “9” ; := “+”  “”  “∗” ;
bin
func “)” ;
For representing rational functions with integer coeﬃcients, it is suﬃcient to add the division symbol “/” to the possible derivations of the binary operator bin . Another example: The following language could be appropriate for discovering trigonometric identities: S func var const trig bin := := := := := := func ; var  const  trig “(” func “)”  “(” func bin func “)” ; “x” ; “0”  “1”  “π” ; “sin”  “cos” ; “+”  “”  “∗” ;
6.2
Manipulating Programs
We have a generic coding of programs—the derivation trees. It remains to deﬁne the three operators random initialization, crossover, and mutation for derivations trees.
6.2. Manipulating Programs
83
6.2.1
Random Initialization
Until now, we did not pay any attention to the creation of the initial population. We assumed implicitly that the individuals of the ﬁrst generation can be generated randomly with respect to a certain probability distribution (mostly uniform). Undoubtedly, this is an absolutely trivial task if we deal with binary strings of ﬁxed length. The random generation of derivation trees, however, is a much more subtle task. There are basically two diﬀerent variants how to generate random programs with respect to a given BNF grammar: 1. Beginning from the starting symbol, it is possible to expand nonterminal symbols recursively, where we have to choose randomly if we have more than one alternative derivations. This approach is simple and fast, but has some disadvantages: Firstly, it is almost impossible to realize a uniform distribution. Secondly, one has to implement some constraints with respect to the depth of the derivation trees in order to avoid excessive growth of the programs. Depending on the complexity of the underlying grammar, this can be a tedious task. 2. GeyerSchulz [11] has suggested to prepare a list of all possible derivation trees up to a certain depth1 and to select from this list randomly applying a uniform distribution. Obviously, in this approach, the problems in terms of depth and the resulting probability distribution are elegantly solved, but these advantages go along with considerably long computation times.
6.2.2
Crossing Programs
It is trivial to see that primitive stringbased crossover of programs almost never yield syntactically correct programs. Instead, we should use the perfect syntax information a derivation tree provides. Already in the LISP times of genetic programming, some time before the BNFbased representation was known, crossover was usually implemented as the exchange of randomly selected subtrees. In case that the subtrees (subexpressions) may have diﬀerent types of return values (e.g. logical and numerical), it is not guaranteed that crossover preserves syntactical correctness.
The depth is deﬁned as the number of all nonterminal symbols in the derivation tree. There is no onetoone correspondence to the height of the tree.
1
84
6. Genetic Programming
The derivation treebased representation overcomes this problem in a very elegant way. If we only exchange subtrees which start from the same nonterminal symbol, crossover can never violate syntactical correctness. In this sense, the derivation tree model provides implicit typechecking. In order to demonstrate in more detail how this crossover operation works, let us reconsider the example of binary logical expressions (grammar deﬁned on page 80). As parents, we take the following expressions: (NOT (x OR y)) ((NOT x) OR (x AND y)) Figure 6.3 shows graphically how the two children (NOT (x OR (x AND y))) ((NOT x) OR y) are obtained.
6.2.3
Mutating Programs
We have always considered mutation as the random deformation of a small part of a chromosome. It is, therefore, not surprising that the most common mutation in genetic programming is the random replacement of a randomly selected subtree. This can be accomplished with the method presented in 6.2.1. The only modiﬁcation is that we do not necessarily start from the start symbol, but from the nonterminal symbol at the root of the subtree we consider. Figure 6.4 shows an example where, in the logical expression (NOT (x OR y)), the variable y is replaced by (NOT y).
6.2.4
The Fitness Function
There is no common recipe for specifying an appropriate ﬁtness function which strongly depends on the given problem. It is, however, worth to emphasize that it is necessary to provide enough information to guide the GA to the solution. More speciﬁcally, it is not suﬃcient to deﬁne a ﬁtness function which assigns 0 to a program which does not solve the problem and 1 to a program which solves the problem—such a ﬁtness function would correspond to a needleinhaystack problem. In this sense, a proper ﬁtness measure should be a gradual concept for judging the correctness of programs.
6.2. Manipulating Programs
85
Parents
<exp> <exp>
"(" <neg> <exp> ")"
"(" <exp> <bin> <exp> ")"
"NOT"
"(" <exp> <bin> <exp> ")"
"(" <neg> <exp> ")"
"OR"
"(" <exp> <bin> <exp> ")"
<var>
"OR"
<var>
"NOT"
<var>
<var>
"AND"
<var>
"x"
"y"
"x"
"x"
"y"
<exp>
<exp>
"(" <neg> <exp> ")"
"(" <exp> <bin> <exp> ")"
"NOT"
"(" <exp> <bin> <exp> ")"
"(" <neg> <exp> ")"
"OR"
<var>
<var>
"OR"
"(" <exp> <bin> <exp> ")"
"NOT"
<var>
"y"
"x"
<var>
"AND"
<var>
"x"
"x"
"y"
Children
Figure 6.3: An example for crossing two binary logical expressions.
86
6. Genetic Programming
<exp>
<exp>
"(" <neg> <exp> ")"
"(" <neg> <exp> ")"
"NOT"
"(" <exp> <bin> <exp> ")"
"NOT"
"(" <exp> <bin> <exp> ")"
2nd of 3
<var>
"OR"
<var>
<var>
"OR"
"(" <neg> <exp> ")"
1st of 1 1st of 3
"x"
"y"
"x"
"NOT"
<var>
2nd of 2
"y"
Figure 6.4: An example for mutating a derivation tree. In many applications, the ﬁtness function is based on a comparison of desired and actually obtained output (compare with 5.1.3, p. 62). Koza, for instance, uses the simple sum of quadratic errors for symbolic regression and the discovery of trigonometric identities:
N
f (F ) =
i=1
(yi − F (xi ))2
In this deﬁnition, F is the mathematical function which corresponds to the program under evaluation. The list (xi , yi )1≤i≤N consists of reference pairs— a desired output yi is assigned to each input xi . Clearly, the samples have to be chosen such that the considered input space is covered suﬃciently well. Numeric errorbased ﬁtness functions usually imply minimization problems. Some other applications may imply maximization tasks. There are basically two wellknown transformations which allow to standardize ﬁtness functions such that always minimization or maximization tasks are obtained. 6.2 Deﬁnition. Consider an arbitrary “raw” ﬁtness function f . Assuming that the number of individuals in the population is not ﬁxed (mt at time t), the standardized ﬁtness is computed as fS (bi,t ) = f (bi,t ) − max f (bj,t )
j=1 mt
6.3. Fuzzy Genetic Programming in case that f is to maximize and as fS (bi,t ) = f (bi,t ) − min f (bj,t )
j=1 mt
87
if f has to be minimized. One possible variant is to consider the best individual of the last k generations instead of only considering the actual generation. Obviously, standardized ﬁtness transforms any optimization problem into a minimization task. Roulette wheel selection relies on the fact that the objective is maximization of the ﬁtness function. Koza has suggested a simple transformation such that, in any case, a maximization problem is obtained. 6.3 Deﬁnition. With the assumptions of Deﬁnition 6.2, the adjusted ﬁtness is computed as mt fA (bi,t ) = max fS (bj,t ) − fS (bj,t ).
j=1
Another variant of adjusted ﬁtness is deﬁned as fA (bi,t ) = 1 . 1 + fS (bj,t )
6.3
Fuzzy Genetic Programming
It was already mentioned that the acquisition of fuzzy rule bases from example data is an important problem (Points 2. and 3. according to the classiﬁcation on pp. 58ﬀ.). We have seen in 5.3, however, that the possibilities for ﬁnding rule bases automatically are strongly limited. A revolutionary idea was introduced by A. GeyerSchulz: To specify a rule language in BNF and to apply genetic programming. For obvious reasons, we refer to this synergistic combination as fuzzy genetic programming. Fuzzy genetic programming elegantly overcomes limitations of all other approaches: 1. If a rule base is represented as a list of rules of arbitrary length, we are not restricted to complete decision tables. 2. We are not restricted to atomic expressions—it is easily possible to introduce additional connectives and linguistic modiﬁers, such as “very”, “at least”, “roughly”, etc. The following example shows how a fuzzy rule language can be speciﬁed in BackusNaur form. Obviously, this fuzzy system has two inputs x1 and x2 .
88
6. Genetic Programming
The output variable is y. The domain of x1 is divided into three fuzzy sets “neg”, “approx. zero”, and “pos”. The domain of x2 is divided into three fuzzy sets which are labeled “small”, “medium”, and “large”. For the output variable y, ﬁve atomic fuzzy sets called “nb”, “nm’, “z”, “pm’, and “pb” are speciﬁed. S rb rule premise neg bin atomic conclusion val1 val2 val3 adverb adjective1 adjective2 adjective3 := := := := := := := := := := := := := := := rb ; rule  rule “,” rb ; “IF” premise “THEN” conclusion ; atomic  “(” neg premise “)”  “(” premise bin premise “)” ; “NOT” ; “AND”  “OR” ; “x1 ” “is” val1  “x2 ” “is” val2 ; “y” “is” val3 ; adjective1  adverb adjective1 ; adjective2  adverb adjective2 ; adjective3  adverb adjective3 ; “at least”  “at most”  “roughly” ; “neg”  “approx. zero”  “pos” ; “small”  “medium”  “large” ; “nb”  “nm”  “z”  “pm’  “pb”;
A very nice example on an application of genetic programming and fuzzy genetic programming to a stock management problem can be found in [12].
6.4
A Checklist for Applying Genetic Programming
We conclude this chapter with a checklist of things which are necessary to apply genetic programming to a given problem: 1. An appropriate ﬁtness function which provides enough information to guide the GA to the solution (mostly based on examples). 2. A syntactical description of a programming language which contains as much elements as necessary for solving the problem. 3. An interpreter for the programming language.
Chapter 7 Classiﬁer Systems
Ever since Socrates taught geometry to the slave boy in Plato’s Meno, the nature of learning has been an active topic of investigation. For centuries, it was province of philosophers, who analytically studied inductive and deductive inference. A hundred years ago, psychology began to use experimental methods to investigate learning in humans and other organisms. Still more recently, the computer has provided a research tool, engendering the ﬁeld of machine learning. J. H. Holland, K. J. Holyoak, R. E. Nisbett, and P. R. Thagard
7.1
Introduction
Almost all GAbased approaches to machine learning problems have in common, ﬁrstly, that they operate on populations of models/descriptions/rule bases and, secondly, that the individuals are judged globally, i.e. there is one ﬁtness value for each model indicating how good it describes the actual interrelations in the data. The main advantage of such approaches is simplicity: There are only two things one has to ﬁnd—a data representation which is suitable for a genetic algorithm and a ﬁtness function. In particular, if the representation is rulebased, no complicated examination which rules are responsible for success or failure has to be done. The convergence of such methods, however, can be weak, because single obstructive parts can deteriorate the ﬁtness of a whole description which could contain useful, wellperforming rules. Moreover, genetic algorithms are 89
90
7. Classifier Systems
often perfect in ﬁnding suboptimal global solutions quickly; local reﬁnement, on the other hand, can take a long time. Another aspect is that it is sometimes diﬃcult to deﬁne a global quality measure which provides enough information to guide a genetic algorithm to a proper solution. Consider, for instance, the game of chess: A global quality measure could be the percentage of successes in a large number of games or, using more speciﬁc knowledge, the number of moves it took to be successful in the case of success and the number of moves it had been possible to postpone the winning of the opponent in the case of failure. It is easy to see that such information provides only a scarce foundation for learning chess, even if more detailed information, such as the number of captured pieces, is involved. On the contrary, it is easier to learn the principles of chess, when the direct eﬀect of the application of a certain rule can be observed immediately or at least after a few steps. The problem, not only in the case of chess, is that early actions can also contribute much to a ﬁnal success. In the following, we will deal with a paradigm which can provide solutions to some of the above problems—the socalled classiﬁer systems of the Michigan type. Roughly speaking, they try to ﬁnd rules for solving a task in an online process according to responses from the environment by employing a genetic algorithm. Figure 7.1 shows the basic architecture of such a system. The main components are: 1. A production system containing a rule base which processes incoming messages from the environment and sends output messages to the environment. 2. An apportionment of credit system which receives payoﬀ from the environment and determines which rules had been responsible for that feedback; this component assigns strength values to the single rules in the rule base. These values represent the performance and usefulness of the rules. 3. A genetic algorithm which recombines wellperforming rules to new, hopefully better ones, where the strengths of the rules are used as objective function values. Obviously, the learning task is divided into two subtasks—the judgment of already existing and the discovery of new rules. There are a few basic characteristics of such systems which are worth to be mentioned:
7.2. Holland Classifier Systems
91
Figure 7.1: Basic architecture of a classiﬁer system of the Michigan type. 1. The basis of the search does not consist of examples which describe the task as in many classical ML methods, but of feedback from the environment (payoﬀ) which judges the correctness/usefulness of the last decisions/actions. 2. There is no strict distinction between learning and working like, for instance, in many ANN approaches. 3. Since learning is done in an online process, Michigan classiﬁer systems can adapt to changing circumstances in the environment.
7.2
Holland Classiﬁer Systems
For illustrating in more detail how such systems basically work, let us ﬁrst consider a common variant—the socalled Holland classiﬁer system.
£qdx{yzyx¥sRp s us
s wv
wvRtuF 2ss5 ¢ A pG1BDh¢365 420 U IH r q IF i 1 g 3 1 ¨$ % f(e&$
[email protected]¤ ! $a¢`Y'W£¦ %¦ b © " X
d i 44chR(g $ ¦ ¤ fe¨ ! xd¥ s us
p o m £qc¢xtrqenyl $ f 7 Yy¦ X W ¨ © ($aYX g ($6( $
`)£§
VRTR5 P§
[email protected] 420 U 1 S Q IH 7 F C A 9 7 3 1 ¨ %$ © " ¥)($'&#¨¢¥¦ ! §¥¤¥§¥£¢ ¤ ¨ © ¨ ¦ ¤ ¡ ¡ f¦e¨ ! )&" $ ¤ ¨ 44¥hYRsd6B d gfe
kg &4d6j
92
7. Classifier Systems
A Holland classiﬁer system is a classiﬁer system of the Michigan type which processes binary messages of a ﬁxed length through a rule base whose rules are adapted according to response of the environment [11, 16, 17].
7.2.1
The Production System
First of all, the communication of the production system with the environment is done via an arbitrarily long list of messages. The detectors translate responses from the environment into binary messages and place them on the message list which is then scanned and changed by the rule base. Finally, the eﬀectors translate output messages into actions on the environment, such as forces or movements. Messages are binary strings of the same length k. More formally, a message belongs to {0, 1}k . The rule base consists of a ﬁxed number m of rules (classiﬁers) which consist of a ﬁxed number r of conditions and an action, where both conditions and actions are strings of length k over the alphabet {0, 1, ∗}. The asterisk plays the role of a wildcard, a “don’t care” symbol. A condition is matched, if and only if there is a message in the list which matches the condition in all nonwildcard positions. Moreover, conditions, except the ﬁrst one, may be negated by adding a “–” preﬁx. Such a preﬁxed condition is satisﬁed, if and only if there is no message in the list which matches the string associated with the condition. Finally, a rule ﬁres, if and only if all the conditions are satisﬁed, i.e. the conditions are connected with AND. Such “ﬁring” rules compete to put their action messages on the message list. This competition will soon be discussed in connection with the apportionment of credit problem. In the action parts, the wildcard symbols have a diﬀerent meaning. They take the role of “pass through” element. The output message of a ﬁring rule, whose action part contains a wildcard, is composed from the nonwildcard positions of the action and the message which satisﬁes the ﬁrst condition of the classiﬁer. This is actually the reason why negations of the ﬁrst conditions are not allowed. More formally, the outgoing message m is deﬁned as ˜ m[i] = ˜ a[i] if a[i] = ∗ m[i] if a[i] = ∗ i = 1, . . . , k,
where a is the action part of the classiﬁer and m is the message which matches the ﬁrst condition. Formally, a classiﬁer is a string of the form Cond1 , [“–”]Cond2 , . . . , [“–”]Condr /Action,
7.2. Holland Classifier Systems where the brackets should express the optionality of the “–” preﬁxes.
93
Depending on the concrete needs of the task to be solved, it may be desirable to allow messages to be preserved for the next step. More speciﬁcally, if a message is not interpreted and removed by the eﬀector interface, it can make another classiﬁer ﬁre in the next step. In practical applications, this is usually accomplished by reserving a few bits of the messages for identifying the origin of the messages (a kind of variable index called tag). Tagging oﬀers new opportunities to transfer information about the current step into the next step simply by placing tagged messages on the list which are not interpreted by the output interface. These messages, which obviously contain information about the previous step, can support the decisions in the next step. Hence, appropriate use of tags permits rules to be coupled to act sequentially. In some sense, such messages are the memory of the system. A single execution cycle of the production system consists of the following steps: 1. Messages from the environment are appended to the message list. 2. All the conditions of all classiﬁers are checked against the message list to obtain the set of ﬁring rules. 3. The message list is erased. 4. The ﬁring classiﬁers participate in a competition to place their messages on the list (see 7.2.2). 5. The winning classiﬁers place their actions on the list. 6. The messages directed to the eﬀectors are executed. This procedure is repeated iteratively. How 6. is done, if these messages are deleted or not, and so on, depends on the concrete implementation. It is, on the one hand, possible to choose a representation such that each output message can be interpreted by the eﬀectors. On the other hand, it is possible to direct messages explicitly to the eﬀectors with a special tag. If no messages are directed to the eﬀectors, the system is in a thinking phase. A classiﬁer R1 is called consumer of a classiﬁer R2 if and only if there is a message m which fulﬁlls at least one of R1 ’s conditions and has been placed on the list by R2 . Conversely, R2 is called a supplier of R1 .
94
7. Classifier Systems
7.2.2
The Bucket Brigade Algorithm
As already mentioned, in each time step t, we assign a strength value ui,t to each classiﬁer Ri . This strength value represents the correctness and importance of a classiﬁer. On the one hand, the strength value inﬂuences the chance of a classiﬁer to place its action on the output list. On the other hand, the strength values are used by the rule discovery system which we will soon discuss. In Holland classiﬁer systems, the adaptation of the strength values depending on the feedback (payoﬀ) from the environment is done by the socalled bucket brigade algorithm. It can be regarded as a simulated economic system in which various agents, here the classiﬁers, participate in an auction, where the chance to buy the right to post the action depends on the strength of the agents. The bid of classiﬁer Ri at time t is deﬁned as Bi,t = cL · ui,t · si , where cL ∈ [0, 1] is a learning parameter, similar to learning rates in artiﬁcial neural nets, and si is the speciﬁty, the number of nonwildcard symbols in the condition part of the classiﬁer. If cL is chosen small, the system adapts slowly. If it is chosen too high, the strengths tend to oscillate chaotically. Then the rules have to compete for the right for placing their output messages on the list. In the simplest case, this can be done by a random experiment like the selection in a genetic algorithm. For each bidding classiﬁer it is decided randomly if it wins or not, where the probability that it wins is proportional to its bid: P[Ri wins] = Bi,t Bj,t
j∈Satt
In this equation, Satt is the set of indices all classiﬁers which are satisﬁed at time t. Classiﬁers which get the right to post their output messages are called winning classiﬁers. Obviously, in this approach, more than one winning classiﬁer is allowed. Of course, other selection schemes are reasonable, for instance, the highest bidding agent wins alone. This can be necessary to avoid that two winning classiﬁers direct conﬂicting actions to the eﬀectors. Now let us discuss how payoﬀ from the environment is distributed and how the strengths are adapted. For this purpose, let us denote the set of
7.2. Holland Classifier Systems
95
classiﬁers, which have supplied a winning agent Ri in step t, with Si,t . Then the new strength of a winning agent is reduced by its bid and increased by its portion of the payoﬀ Pt received from the environment: ui,t+1 = ui,t + Pt − Bi,t , wt
where wt is the number of winning agents in the actual time step. A winning agent pays its bid to its suppliers which share the bid among each other, equally in the simplest case: ul,t+1 = ul,t + Bi,t Si,t  for all Rl ∈ Si,t
If a winning agent has also been active in the previous step and supplies another winning agent, the value above is additionally increased by one portion of the bid the consumer oﬀers. In the case that two winning agents have supplied each other mutually, the portions of the bids are exchanged in the above manner. The strengths of all other classiﬁers Rn , which are neither winning agents nor suppliers of winning agents, are reduced by a certain factor (they pay a tax): un,t+1 = un,t · (1 − T ), T is a small values from [0, 1]. The intention of taxation is to punish classiﬁers which never contribute anything to the output of the system. With this concept, redundant classiﬁers, which never become active, can be ﬁltered out. The idea behind credit assignment in general and bucket brigade in particular is to increase the strengths of rules which have set the stage for later successful actions. The problem of determining such classiﬁers, which were responsible for conditions under which it was later on possible to receive a high payoﬀ, can be very diﬃcult. Consider, for instance, the game of chess again, in which very early moves can be signiﬁcant for a late success or failure. In fact, the bucket brigade algorithm can solve this problem, although strength is only transferred to the suppliers which were active in the previous step. Each time the same sequence is activated, however, a little bit of the payoﬀ is transferred one step back in the sequence. It is easy to see that repeated successful execution of a sequence increases the strengths of all involved classiﬁers. Figure 7.2 shows a simple example how the bucket brigade algorithm works. For simplicity, we consider a sequence of ﬁve classiﬁers which always
96
7. Classifier Systems
First execution
60 20 20 20 20 Payoff
80
80
80
80
80
Strengths
100
100
100
100
140
Second execution
60
Payoff
20
20
20
28
80
80
80
80
112
Strengths
100
100
100
108
172
Figure 7.2: The bucket brigade principle.
bid 20 percent of their strength. Only after the ﬁfth step, after the activation of the ﬁfth classiﬁer, a payoﬀ of 60 is received. The further development of the strengths in this example is shown in the table in Figure 7.3. It is easy to see from this example that the reinforcement of the strengths is slow at the beginning, but it accelerates later. Exactly this property contributes much to the robustness of classiﬁer systems—they tend to be cautious at the beginning, trying not to rush conclusions, but, after a certain number of similar situations, the system adopts the rules more and more. Figure 7.4 shows a graphical visualization of this fact interpreting the table shown in Figure 7.3 as a twodimensional surface. It might be clear, that a Holland classiﬁer system only works if successful sequences of classiﬁer activations are observed suﬃciently often. Otherwise the bucket brigade algorithm does not have a chance to reinforce the strengths of the successful sequence properly.
7.2. Holland Classifier Systems
97
Strength after the 3rd 4th 5th 6th . . . 10th . . . 25th . . . execution of the sequence
100.00 100.00 101.60 120.80 172.00 100.00 100.32 105.44 136.16 197.60 100.06 101.34 111.58 152.54 234.46 100.32 103.39 119.78 168.93 247.57 106.56 215.86 124.17 253.20 164.44 280.36 224.84 294.52 278.52 299.24
Figure 7.3: An example for repeated propagation of payoﬀs.
7.2.3
Rule Generation
While the apportionment of credit system just judges the rules, the purpose of the rule discovery system is to eliminate lowﬁtted rules and to replace them by hopefully better ones. The ﬁtness of a rule is simply its strength. Since the classiﬁers of a Holland classiﬁer system themselves are strings, the application of a genetic algorithm to the problem of rule induction is straightforward, though many variants are reasonable. Almost all variants have in common that the GA is not invoked in each time step, but only every nth step, where n has to be set such that enough information about the performance of new classiﬁers can be obtained in the meantime. A. GeyerSchulz [11], for instance, suggests the following procedure, where the strength of new classiﬁers is initialized with the average strength of the current rule base: 1. Select a subpopulation of a certain size at random. 2. Compute a new set of rules by applying the genetic operations selection, crossing over, and mutation to this subpopulation. 3. Merge the new subpopulation with the rule base omitting duplicates and replacing the worst classiﬁers. This process of acquiring new rules has an interesting side eﬀect. It is more than just the exchange of parts of conditions and actions. Since we
98
7. Classifier Systems
300 250 200 150 100 20 10 0
Figure 7.4: A graphical representation of the table shown in Figure 7.3.
have not stated restrictions for manipulating tags, the genetic algorithm can recombine parts of already existing tags to invent new tags. In the following, tags spawn related tags establishing new couplings. These new tags survive if they contribute to useful interactions. In this sense, the GA additionally creates experiencebased internal structures autonomously.
7.3
Fuzzy Classiﬁer Systems of the Michigan Type
While classiﬁer systems of the Michigan type have been introduced by J. H. Holland already in the Seventies, their fuzziﬁcation awaited discovery many years. The ﬁrst fuzzy classiﬁer system of the Michigan type was introduced by M. ValenzuelaRend´n [28, 29]. It is, more or less, a straightforward o fuzziﬁcation of a Holland classiﬁer system. An alternative approach has been developed by A. Bonarini [5, 6], who introduced a diﬀerent scheme of competition between classiﬁers. These two approaches have in common that they operate only on the fuzzy rules—the shape of the membership functions is ﬁxed. A third method, which was introduced by P. Bonelli and A. Parodi [24], tries to optimize even the membership functions and the output weights in accordance to payoﬀ from the environment.
7.3. Fuzzy Classifier Systems of the Michigan Type
99
7.3.1
Directly Fuzzifying Holland Classiﬁer Systems
The Production System We consider a fuzzy controller with realvalued inputs and outputs. The system has, unlike ordinary fuzzy controllers, three diﬀerent types of variables— input, output, and internal variables. As we will see later, internal variables are for the purpose of storing information about the near past. They correspond to the internally tagged messages in Holland classiﬁer systems. For the sake of generality and simplicity, all domains of all variables are intervals transformed to the unit interval [0, 1]. For each variable, the same number of membership functions n is assumed. These membership functions are ﬁxed at the beginning. They are not changed throughout the learning process. M. ValenzuelaRend´n took bellshaped function dividing the input domain o equally. A message is a binary string of length l + n, where n is the number of membership functions deﬁned above and l is the length of the preﬁx (tag), which identiﬁes the variable to which the message belongs. A perfect choice for l would be log2 K , where K is the total number of variables we want to consider. To each message, an activity level, which represents a truth value, is assigned. Consider, for instance, the following message (l = 3, n = 5): 010 : 00010 → 0.6
=2
Its meaning is “Input value no. 2 belongs to fuzzy set no. 4 with a degree of 0.6”. On the message list, only socalled minimal messages are used, i.e. messages with only one 1 in the right part which corresponds to the indices of the fuzzy sets. Classiﬁers again consist of a ﬁxed number r of conditions and an action part. Note that, in this approach, no wildcards and no “–” preﬁxes are used. Both condition and action part are also binary strings of length l + n, where the tag and the identiﬁers of the fuzzy sets are separated by a colon. The degree to which such a condition is matched is a truth value between 0 and 1. The degree of matching of a condition is computed as the maximal activity of messages on the list, which have the same tag and whose 1s are a subset of those of the condition. Figure 7.5 shows a simple example how this matching is done. The degree of satisfaction of the whole classiﬁer is then computed as the minimum of matching degrees of the conditions. This value is then used as the activity level which is assigned to the output message (corresponds to Mamdani inference).
100
7. Classifier Systems
Message List
0 0 0 0 0 1 0 1 1 0 0 : 0 1 : 1 0 : 0 0 : 0 0 : 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0.3 0.7 0.8 0.4 1.0 0 0 0
Messages with same tag
1 1 1 0 : 0 0 : 0 0 : 0 1 0 0 0 1 0 0 0 1 0 0 0 0.3 0.8 0.4 max
Condition
0
1
0 : 0
1
1
0
0
0.8
Figure 7.5: Matching a fuzzy condition. The whole rule base consists of a ﬁxed number m of such classiﬁers. Similarly to Holland classiﬁer systems, one execution step of the production system is carried out as follows: 1. The detectors receive crisp input values from the environment and translate them into minimal messages which are then added to the message list. 2. The degrees of matching are computed for all classiﬁers. 3. The message list is erased. 4. The output messages of some matched classiﬁers (see below) are placed on the message list. 5. The output messages are translated into minimal messages. For instance, the message 010 : 00110 → 0.9 is split up into the two messages 010 : 00010 → 0.9 and 010 : 00100 → 0.9. 6. The eﬀectors discard the output messages (referring to output variables) from the list and translate them into instructions to the environment. Step 6 is done by a slightly modiﬁed Mamdani inference: The sum (instead of the maximum or another tconorm) of activity levels of messages, which refer to the same fuzzy set of a variable, is computed. The membership functions are then scaled with these sums. Finally, the center of gravity of the “union” (i.e. maximum) of these functions, which belong to one variable, is computed (SumProd inference).
7.3. Fuzzy Classifier Systems of the Michigan Type Credit Assignment
101
Since fuzzy systems have been designed to model transitions, a probabilistic auction process as discussed in connection with Holland classiﬁer systems, where only a small number of rules is allowed to ﬁre, is not desirable. Of course, we assign strength values to the classiﬁers again. If we are dealing with a onestage system, in which payoﬀ for a certain action is received immediately, where no longterm strategies must be evolved, we can suﬃce with allowing all matched rules to post their outputs and sharing the payoﬀ among the rules, which were active in the last step, according to their activity levels in this step. For example, if St is the set of classiﬁers, which have been active at time t, and Pt is the payoﬀ received after the tth step, the modiﬁcation of the strengths of ﬁring rules can be deﬁned as ui,t+1 = ui,t + Pt · ai,t aj,t
Rj ∈St
∀Ri ∈ St ,
(7.1)
where ai,t denotes the activity level of the classiﬁer Ri at time t. It is again possible to reduce the strength of inactive classiﬁers by a certain tax. In the case, that the problem is so complex that longterm strategies are indispensable, a fuzziﬁcation of the bucket brigade mechanism must be found. While ValenzuelaRend´n only provides a few vague ideas, we state o one possible variant, where the ﬁring rules pay a certain value to their suppliers which depends on the activity level. The strength of a classiﬁer, which has recently been active in time step t, is then increased by a portion of the payoﬀ as deﬁned in (7.1), but it is additionally decreased by a value Bi,t = cL · ui,t · ai,t , where cL ∈ [0, 1] is the learning parameter. Of course, it is again possible to incorporate terms which depend on the speciﬁty of the classiﬁer. This “fuzzy bid” is then shared among the suppliers of such a ﬁring classiﬁer according to the amount they have contributed to the matching of the consumer. If we consider an arbitrary but ﬁxed classiﬁer Rj , which has been active in step t and if we denote the set of classiﬁers supplying Rj , which have been active in step t − 1, with Sj,t , the change of the strengths of these suppliers can be deﬁned as uk,t+1 = uk,t + Bj,t · ak,t−1 al,t−1
Rl ∈Sj,t
for all Rk ∈ Sj,t .
102 Rule Discovery
7. Classifier Systems
The adaptation of a genetic algorithm to the problem of manipulating classiﬁers in our system is again straightforward. We only have to take special care that tags in conditional parts must not refer to output variables and that tags in the action parts of the classiﬁers must not refer to input variables of the system. Analogously to our previous considerations, if we admit a certain number of internal variables, the system can build up internal chains automatically. By means of internal variables, a classiﬁer system of this type does not only learn stupid inputoutput actions, it also tries to discover causal interrelations.
7.3.2
Bonarini’s ELF Method
In [5], A. Bonarini presents his ELF (evolutionary learning of fuzzy rules) method and applies it to the problem of guiding an autonomous robot. The key issue of ELF is to ﬁnd a small rule base which only contains important rules. While he takes over many of M. ValenzuelaRend´n’s ideas, his way of o modifying the rule base diﬀers strongly from ValenzuelaRend´n’s straighto forward fuzziﬁcation of Holland classiﬁer systems. Bonarini calls the modiﬁcation scheme “coverdetector algorithm”. The number of rules can be varied in each time step depending on the number of rules which match the actual situation. This is done by two mutually exclusive operations: 1. If the rules, which match the actual situation, are too many, the worst of them is deleted. 2. If there are too few rules matching the current inputs, a new rule, the antecedents of which cover the current state, is added to the rule base with randomly chosen consequent value. The genetic operations are only applied to the consequent values of the rules. Since the antecedents are generated on demand in the diﬀerent time steps, no taxation is necessary. Obviously, such a simple modiﬁcation scheme can only be applied to socalled onestage problems, where the eﬀect of each rule can be observed in the next time step. For applications where this is not the case, e.g. backing up a truck, Bonarini introduced an additional concept to his ELF algorithm—the
7.3. Fuzzy Classifier Systems of the Michigan Type
103
notion of an episode, which is a given number of subsequent control actions, after which the reached state is evaluated (for details, see [6]).
7.3.3
Online Modiﬁcation of the Whole Knowledge Base
While the last two methods only manipulate rules and work with ﬁxed membership functions, there is at least one variant of fuzzy classiﬁer systems where also the shapes of the membership functions are involved in the learning process. This variant was introduced by A. Parodi and P. Bonelli in [24]. Let us restrict to the very basic idea here: A rule is not encoded with indices pointing to membership functions of a given shape. Instead, each rule contains codings of fuzzy sets like the ones we discussed in 5.1.
104
7. Classifier Systems
Bibliography
[1] Bagley, J. D. The Behavior of Adaptive Systems Which Employ Genetic and Correlative Algorithms. PhD thesis, University of Michigan, Ann Arbor, 1967. [2] Bauer, P., Bodenhofer, U., and Klement, E. P. A fuzzy algorithm for pixel classiﬁcation based on the discrepancy norm. In Proc. FUZZIEEE’96 (1996), vol. III, pp. 2007–2012. [3] Bodenhofer, U. Tuning of fuzzy systems using genetic algorithms. Master’s thesis, Johannes Kepler Universit¨t Linz, March 1996. a [4] Bodenhofer, U., and Herrera, F. Ten lectures on genetic fuzzy systems. In Preprints of the International Summer School: Advanced Control—Fuzzy, Neural, Genetic, R. Mesiar, Ed. Slovak Technical University, Bratislava, 1997, pp. 1–69. [5] Bonarini, A. ELF: Learning incomplete fuzzy rule sets for an autonomous robot. In Proc. EUFIT’93 (1993), vol. I, pp. 69–75. [6] Bonarini, A. Evolutionary learning of fuzzy rules: Competition and cooperation. In Fuzzy Modeling: Paradigms and Practice, W. Pedrycz, Ed. Kluwer Academic Publishers, Dordrecht, 1996, pp. 265–283. [7] Bulirsch, R., and Stoer, J. Introduction to Numerical Analysis. Springer, Berlin, 1980. [8] Chen, C. L., and Chang, M. H. An enhanced genetic algorithm. In Proc. EUFIT’93 (1993), vol. II, pp. 1105–1109. [9] Darwin, C. R. On the Origin of Species by means of Natural Selection and The Descent of Man and Selection in Relation to Sex, third ed., vol. 49 of Great Books of the Western World, Editor in chief: M. J. Adler. Robert P. Gwinn, Chicago, IL, 1991. First edition John Murray, London, 1859. 105
106
Bibliography
[10] Engesser, H., Ed. Duden Informatik: Ein Sachlexikon f¨r Studium u und Praxis, second ed. Brockhaus, Mannheim, 1993. [11] GeyerSchulz, A. Fuzzy RuleBased Expert Systems and Genetic Machine Learning, vol. 3 of Studies in Fuzziness. Physica Verlag, Heidelberg, 1995. [12] GeyerSchulz, A. The MIT beer distribution game revisited: Genetic machine learning and managerial behavior in a dynamic decision making experiment. In Genetic Algorithms and Soft Computing, F. Herrera and J. L. Verdegay, Eds. Physica Verlag, Heidelberg, 1996, pp. 658–682. [13] Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning. AddisonWesley, Reading, MA, 1989. [14] Goldberg, D. E., Korb, B., and Deb, K. Messy genetic algorithms: Motivation, analysis, and ﬁrst results. Complex Systems 3 (1989), 493–530. [15] Hoffmann, F. Entwurf von FuzzyReglern mit Genetischen Algorithmen. Deutscher Universit¨tsVerlag, Wiesbaden, 1997. a [16] Holland, J. H. Adaptation in Natural and Artiﬁcial Systems, ﬁrst MIT Press ed. The MIT Press, Cambridge, MA, 1992. First edition: University of Michigan Press, 1975. [17] Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. Induction: Processes of Inference, Learning, and Discovery. Computational Models of Cognition and Perception. The MIT Press, Cambridge, MA, 1986. ¨ [18] Horner, H. A C++ class library for genetic programming: The Vienna university of economics genetic programming kernel. Tech. rep., Vienna University of Economics, May 1996. [19] Koza, J. R. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge, MA, 1992. [20] Kruse, R., Gebhardt, J., and Klawonn, F. FuzzySysteme. B. G. Teubner, Stuttgart, 1993. [21] Kruse, R., Gebhardt, J., and Klawonn, F. Foundations of Fuzzy Systems. John Wiley & Sons, New York, 1994.
Bibliography
107
[22] Neunzert, H., and Wetton, B. Pattern recognition using measure space metrics. Tech. Rep. 28, Universit¨t Kaiserslautern, Fachbereich a Mathematik, November 1987. [23] Otten, R. H. J. M., and van Ginneken, L. P. P. P. The Annealing Algorithm. Kluwer Academic Publishers, Boston, 1989. [24] Parodi, A., and Bonelli, P. A new approach to fuzzy classiﬁer systems. In Proc. ICGA’97 (Los Altos, CA, 1993), S. Forrest, Ed., Morgan Kaufmann, pp. 223–230. [25] Rechenberg, I. Evolutionsstrategie, vol. 15 of Problemata. Friedrich Frommann Verlag (G¨nther Holzboog KG), Stuttgart, 1973. u [26] Rumelhart, D. E., and McClelland, J. L. Parallel Distributed Processing—Exploration in the Microstructures of Cognition, Volume I: Foundations. MIT Press, Cambridge, MA, 1986. [27] Tilli, T. Automatisierung mit FuzzyLogik. FranzisVerlag, M¨nchen, u 1992. ´ [28] ValenzuelaRendon, M. The fuzzy classiﬁer system: A classiﬁer system for continuously varying variables. In Proc. ICGA’91 (San Mateo, CA, 1991), R. K. Belew and L. B. Booker, Eds., Morgan Kaufmann, pp. 346–353. ´ [29] ValenzuelaRendon, M. The fuzzy classiﬁer system: Motivations and ﬁrst results. In Parallel Problem Solving from Nature, H.P. Schwefel and R. M¨nner, Eds. Springer, Berlin, 1991, pp. 330–334. a [30] van Laarhoven, P. J. M., and Aarts, E. H. L. Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Dordrecht, 1987. [31] Zimmermann, H.J. Fuzzy Set Theory—and its Applications, second ed. Kluwer Academic Publishers, Boston, 1991. [32] Zurada, J. M. Introduction to Artiﬁcial Neural Networks. West Publishing, St. Paul, 1992.
108
Bibliography