Genetic Algorithm
Genetic Algorithm
Genetic Algorithm
Genetic Algorithms
There are no known polynomial time algorithms to solve many real-world optimization
problems making them hard to solve. A number of heuristics have been designed to
solve the hard problems. These heuristics may provide sub-optimal but acceptable
solution in a reasonable computational time. A number of meta-heuristics such as
simulated annealing, evolutionary algorithms, artificial neural networks derived from
natural physical and biological phenomena have also been used to solve these problems.
Genetic Algorithms (GAs) are adaptive procedures derived from Darwins principal of
survival of the fittest in natural genetics. GA maintains a population of potential
solutions of the candidate problem termed as individuals. By manipulation of these
individuals through genetic operators such as selection, crossover and mutation, GA
evolves towards better solutions over a number of generations. Implementation of a
genetic algorithm is shown in a flowchart in figure-1.
NO
Genetic algorithms start with randomly created initial population of individuals that
involves encoding of every variable. A string of variables makes a chromosome or
individual. In the beginning phase of implementation of genetic algorithm in early
Genetic Algorithm and its Application in Data Mining
seventies, it was applied to solve continuous optimization problems with binary coding
of variables. Binary variables are mapped to real numbers in numerical problems. Later,
GA has been used to solve many combinatorial optimization problems such as 0/1
knapsack problem, travelling salesperson problem, scheduling problems, etc. Binary
coding has not been found suitable to solve many of these problems. Therefore, coding
other than binary have also been utilized. Continuous function optimization uses real-
number coding. Problems such as traveling salesperson problem and graph coloring use
permutation coding. Genetic programming applications use tree coding.
GA use fitness function derived from the objective function of the optimization problem
to evaluate the individuals in a population. Fitness function is the measure of an
individuals fitness, which is used to select individuals for reproduction. Many of the
real world problems may not have a well defined objective function and require the user
to define a fitness function.
Selection method in a GA selects parents from the population on the basis of fitness of
individuals. High fitness individuals are selected with higher probability of selection to
reproduce offsprings for the next population. Selection methods assign a probability P(x)
to each individual in the population at current generation, which is proportional to the
fitness of individual x relative to rest of the population.
The fitness-proportionate selection is extremely biased towards the fit individuals in the
population and exerts high selection pressure. It causes pre-mature convergence of GA
as population is made up of highly fit individuals after a few generations and there is no
fitness-bias for selection procedure to work. Therefore, other selection methods such as
tournament selection, rank selection are used to avoid this biasness. Tournament
selection compares two or more randomly selected individuals and selects the better
individual with a pre-specified probability. Rank selection calculates probability of
selection of individuals on the basis of ranking according to increasing fitness values in a
population.
In a standard genetic algorithm, two parents are selected at a time and are used to create
two new children to take part in the next generation. The offsprings are subject to
crossover operator with a pre-specified probability of crossover. Single-point crossover
is the most common form of this operator. It marks a random crossover spot within the
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
295
Genetic Algorithm and its Application in Data Mining
size of chromosome and exchanges the bits (in binary coding) on the right of the spot as
shown below.
Mutation operator is applied to all the children after crossover. It flips each bit in the
individual with a pre-specified probability of mutation. An example of mutation is given
below where fifth bit has been mutated.
0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1
Data mining has been used to analyze large datasets and establish useful classification
and patterns in the datasets. Agricultural and biological research studies have used
various techniques of data mining including natural trees, statistical machine learning
and other analysis methods. Genetic algorithm has been widely used in data mining
applications such as classification, clustering, feature selection, etc. Two applications of
GA in data mining are described below.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
296
Genetic Algorithm and its Application in Data Mining
S. C. Shah, A. Kusiak, 2004 have applied genetic algorithm for feature selection for
mining SNPs in association studies. Genomic studies provide large volumes of data
with thousands of single nucleotide polymorphisms (SNPs). The analysis of SNPs
determines relationships between genotypic and phenotypic information. It helps in
identification of SNPs related to a disease. An approach for predicting drug effectiveness
is developed that is based on data mining and genetic algorithms. A global search
mechanism, weighted decision tree, decision-tree-based wrapper, a correlation-based
heuristic, and the identification of intersecting feature sets with genetic algorithm are
employed for selecting significant genes. The feature selection approach has resulted in
85% reduction of number of features. The relative increase in cross-validation accuracy
and specificity for the significant gene/SNP set selected was found 10% and 3.2%. The
feature selection approach was successfully applied to data sets for drug and placebo
subjects. The number of features has been significantly reduced while the quality of
knowledge was enhanced.
References
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
297