GADataMining CNA
GADataMining CNA
GADataMining CNA
Sid Bhattacharyya
Overview
Genetic Algorithms: a gentle introduction
What are GAs How do they work/ Why? Critical issues
Natural Genetics to AI
Computational models inspired by biological evolution
survival of the fittest reproduction through cross-breeding
Genetic Algorithms
Population based search (parallel)
simultaneous search from multiple points in search space
selection
Advantage GAs
General purpose, robust search technique
application to varied problem types
Data mining
fitness function: flexible expression of modeling criteria, tradeoffs amongst multiple objectives models optimized to specific business objectives diverse model representation linear, non-linear interaction terms, rules, sequences, etc.
GA Application Examples
Function optimizers difficult, discontinuous, multi-modal, noisy functions Combinatorial optimization layout of VLSI circuits, factory scheduling, traveling salesman problem Design and Control bridge structures, neural networks, communication networks design; control of chemical plants, pipelines Machine learning classification rules, economic modeling, scheduling strategies Portfolio design, optimized trading models, direct marketing models, sequencing of TV advertisements, adaptive agents, data mining, etc.
Recombination
Crossover
Mutation
Offspring1(1,4) Offspring2(1,4) Offspring3(2,7) Offspring4(2,7) ... ... OffspringN(x,y)
Generation t
Generation t+1
X X
Hill climber
Typical GA Run
Fitness Best Average
Generations
Operators: Selection
Fitness proportionate selection (fi/f ) number of reproductive trials for individuals
Selection
Roulette-wheel selection
(stochastic sampling with replacement)
wheel spaced in proportion to fitness values N (pop size) spins of the wheel
Selection
Stochastic universal sampling
N equally spaced pins on wheel single turn of the wheel
Selection
Premature converge Fitness scaling
f = f - (2*avg. - max.)
Operators: Crossover
Parent 1: 11010 101100101 Parent 2: xxyxx yxyyxxyxy
crossover site
Crossover
Parent 1: axpsqvqbtpihd Parent 2: qzxxaycgbtphw
crossover sites
Crossover
Fitness
X
Parents Offspring
Operators: Mutation
alters each gene with small probability
x1yx0y0yy0x yxy x1yx0y1yy0x xxy
Recombination operators
Mutation & premature convergence Mutation vs. Crossover
operator probabilities which is more important?
Non-Binary Representations
Integer, real-number, order-based, rules, ... Binary or Real-valued?
real representations give faster, more consistent, more accurate results
High-level representation
intuitive, can utilize specialized crossover and mutation effective search over complex spaces design of representation and operators --forma theory
Real-valued representation
Parent1: Parent2: 3.45 0.56 6.78 0.976 2.5 0.98 1.06 4.20 0.34 1.8
Offspring1: 3.22 0.56 6.78 0.65 2.12 Offspring2: 1.43 1.06 4.20 0.41 1.93
(Arithmetic crossover)
High-level representation
Parent1: {(1 .2 x 1 3 .4 ) (5 .8 x 2 6 .0 ) (0 .2 x7 0 .6 1 )} Parent2: {(2.3 x 41 . ) (36 . x2 51 . ) (51 . x4 561 . ) 6 Offspring1: {(1.2 x 1 3.4) ( 2.2 x9 2.7 ) (51 . x4 5.61)} Offspring2:
{(2.3 x6 4.1) [( 3.6 x2 51 . ) (5.8 x 2 6.0)]
( 0.3 x3 11 . ) (0 .2 x7 0 .6 1 )}
High-level representation
Generalize/Specialize
{( 0.3 x3 11 . ) ( 2.2 x9 2.7 )}
{ ( 0.3 x 3 11 . ) ( 2 .2 x 9 2 .7 ) (51 . x 4 6.2 )}
y x 5 y
(x log(y))/5) x 2
If (y<7) and (x>2) then 0, else 2x+y
Exploration vs. exploitation of search Does not guarantee optimality ! But . Structured population models Parallelizable for large data
Robust, assumption-free, and very general Hybrid approaches -- GAs with conventional optimization techniques
Using GAs ?
When to use a GA? GA and traditional techniques How long does it take? Will it perform better?
Using GAs
population size mutation, crossover rates how many generations multiple runs
Is it a black-box?
?
Data characteristics Fitness function GA parameters
Huh?
GA Application Examples
Function optimizers
difficult, discontinuous, multimodal, noisy functions
Combinatorial optimization
layout of VLSI circuits, factory scheduling
Machine learning
classification rules, economic modeling, scheduling strategies
Portfolio design, optimized trading models, direct marketing models, sequencing of TV advertisements, adaptive agents, data mining, etc.
Data Mining
Pattern templates
([attribute in {v1,v2}] and [attribute=value]) or ([attribute in {v1,v2,v3}] and [attribute>value]) or
when S, if C then P
when region=ne if inc > 41K and child>2 then x-sales>100 C P
when S, C and P are positively correlated the mean of A when S and C, is significantly different from the mean of A when S
Data mining
How good are the patterns
accuracy coverage support
# cases in C and P # cases in C # cases in C and P # cases in P # cases in C # cases in S
Understandability
ri c j
SI C C
P n11 c1
2 =
n ( nij cij ) 2
eij
n21 n22
Cramer' s V =
higher values imply C and P are related Correlation linear correlation -- product moment corr. coefficient monotonically correlated -- Spearmans rank corr. coeff. Correlation coefficient x support Interesting rule S I C I P
S IC S I P S
DM application
Symbolic models of consumer choice
{ ( 3 5 in c 4 0 K ) ( a g e < 4 3 ) o r ( in c > 6 3 K ) ( a g e > 5 5 ) } t h e n B u y
assumption-free behavioral insights for targeting promotions advantage over decision trees algorithms?
DTs are stepwise optimal, but not globally so high noise-sensitivity of DTs
Performance evaluation
Accuracy/Error rate
will higher accuracy give better performance for the target task?
The use of error rate often suggests insufficiently careful thought about the real objectives of the research David J. Hand, Construction and Assessment of Classification Rules.
Actual P Predicted P N N
True P False N
False P True N
sensitivity, specificity misclassification costs Of course, with 99:1 split in data, default dummy model gives 99% accuracy.
Model Representation
Non-linear tree-structured models (GP)
Non-linear interaction terms Function set : internal nodes
{+,-,*,/,log}
* / log x3 x1 5
(x1 log(x3))/5)
* 100.
Decile Maximization(DMAX)
Objective
Find model f(x) (predictor variables x) such that performance in upper deciles (specified depth-of-file) is maximized Number of
Decile
Explicitly manages resource constraint mailings to particular depths-of file Performance at different mailing depths models optimized for different mailing depths
top 2 3 4 5 6 7 8 9 bottom
$5 $9 $7 $3 $1 $8
Profit $10 $9 $8 $7 $6 $5 $4 $3 $2 $1
X1 45 35 31 30 6 45 30 23 16 12
X2 5 21 38 30 10 37 10 30 13 30
GA DMAX
Representation: w1 w2 w3 .. wk Integrated variable selection Fitness evaluation
classification accuracy model reliability maximize specified decile performance
response, profit, etc.
Hybrid algorithm