Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

GADataMining CNA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Genetic Algorithms for Data Mining

Sid Bhattacharyya

Overview
Genetic Algorithms: a gentle introduction
What are GAs How do they work/ Why? Critical issues

Using Genetic algorithms (effectively) Use in Data Mining

Natural Genetics to AI
Computational models inspired by biological evolution
survival of the fittest reproduction through cross-breeding

Genetic Algorithms
Population based search (parallel)
simultaneous search from multiple points in search space

population members: potential solutions

Fitness function (search objective)


numerical figure of merit/utility measure of an individual

selection

Mating and reproduction of individuals


crossover, mutation

Evolution from one generation to the next


iterative search, convergence

Advantage GAs
General purpose, robust search technique
application to varied problem types

Data mining
fitness function: flexible expression of modeling criteria, tradeoffs amongst multiple objectives models optimized to specific business objectives diverse model representation linear, non-linear interaction terms, rules, sequences, etc.

GA Application Examples
Function optimizers difficult, discontinuous, multi-modal, noisy functions Combinatorial optimization layout of VLSI circuits, factory scheduling, traveling salesman problem Design and Control bridge structures, neural networks, communication networks design; control of chemical plants, pipelines Machine learning classification rules, economic modeling, scheduling strategies Portfolio design, optimized trading models, direct marketing models, sequencing of TV advertisements, adaptive agents, data mining, etc.

GAs: Basic Principles


Representation of individuals
String of parameters (genes) : chromosome
eg. F(p,q,r,s,t): p q r s t

Bit-string representation (?):


100110101101100

genotype and phenotype

GAs: Basic Principles


Survival of the fittest (Fitness function)
numerical figure of merit/utility measure of an individual tradeoff amongst a multiple evaluation criteria efficient evaluation

GAs: Basic Principles


Reproduction to create offspring
Selection Crossover Mutation

GAs: Basic Principles


Convergence
progression towards uniformity in population premature convergence? (local optima)

GA: Basic Operation


Selection
Solution1 (f 1) Solution2 (f 2) Solution3 (f 3) Solution4 (f 4) ... ... SolutionN (f N )

Recombination
Crossover
Mutation
Offspring1(1,4) Offspring2(1,4) Offspring3(2,7) Offspring4(2,7) ... ... OffspringN(x,y)

Solution1 Solution2 Solution2 Solution4 ... ... SolutionX

Generation t

Generation t+1

GAs: Parallel Search


Fitness

X X

Hill climber

Typical GA Run
Fitness Best Average

Generations

Operators: Selection
Fitness proportionate selection (fi/f ) number of reproductive trials for individuals

Selection
Roulette-wheel selection
(stochastic sampling with replacement)

wheel spaced in proportion to fitness values N (pop size) spins of the wheel

Selection
Stochastic universal sampling
N equally spaced pins on wheel single turn of the wheel

Selection
Premature converge Fitness scaling
f = f - (2*avg. - max.)

Ranked fitness Elitism Steady-state selection Demetic grouping

Operators: Crossover
Parent 1: 11010 101100101 Parent 2: xxyxx yxyyxxyxy
crossover site

Offspring 1: 11010 yxyyxxyxy Offspring 2: xxyxx 101100101


(Single-pt. crossover)

combining good building blocks

Crossover
Parent 1: axpsqvqbtpihd Parent 2: qzxxaycgbtphw
crossover sites

Offspring 1: azpsavcbtpphd Offspring 2: qxxxqyqgbtihw


(Uniform crossover)

Crossover
Fitness
X

Parents Offspring

Operators: Mutation
alters each gene with small probability
x1yx0y0yy0x yxy x1yx0y1yy0x xxy

Recombination operators
Mutation & premature convergence Mutation vs. Crossover
operator probabilities which is more important?

Optimal parameter settings (!)

Non-Binary Representations
Integer, real-number, order-based, rules, ... Binary or Real-valued?
real representations give faster, more consistent, more accurate results

High-level representation
intuitive, can utilize specialized crossover and mutation effective search over complex spaces design of representation and operators --forma theory

Real-valued representation
Parent1: Parent2: 3.45 0.56 6.78 0.976 2.5 0.98 1.06 4.20 0.34 1.8

Offspring1: 3.22 0.56 6.78 0.65 2.12 Offspring2: 1.43 1.06 4.20 0.41 1.93
(Arithmetic crossover)

High-level representation
Parent1: {(1 .2 x 1 3 .4 ) (5 .8 x 2 6 .0 ) (0 .2 x7 0 .6 1 )} Parent2: {(2.3 x 41 . ) (36 . x2 51 . ) (51 . x4 561 . ) 6 Offspring1: {(1.2 x 1 3.4) ( 2.2 x9 2.7 ) (51 . x4 5.61)} Offspring2:
{(2.3 x6 4.1) [( 3.6 x2 51 . ) (5.8 x 2 6.0)]
( 0.3 x3 11 . ) (0 .2 x7 0 .6 1 )}

( 0.3 x3 11 . ) (2.2 x9 2.7 )}

High-level representation
Generalize/Specialize
{( 0.3 x3 11 . ) ( 2.2 x9 2.7 )}
{ ( 0.3 x 3 11 . ) ( 2 .2 x 9 2 .7 ) (51 . x 4 6.2 )}

{( 0.3 x3 11 . ) ( 2.2 x9 2.7 )}


{ ( 0.4 5 x 3 0.9 ) (1.9 x 9 2 .9 )}

Tree-structured representation (GP)


if * /
log
AND < >

y x 5 y

(x log(y))/5) x 2
If (y<7) and (x>2) then 0, else 2x+y

Genetic search: Issues


Coding scheme, fitness function critical
General mechanism so robust that, within reasonable margins, parameter settings are not critical. exploiting problem-specific knowledge the art in GA design!

Genetic search: Issues


Stochastic search
multiple runs with different random streams

Exploration vs. exploitation of search Does not guarantee optimality ! But . Structured population models Parallelizable for large data

GAs and Optimization


Search space: representation Global search without gradient information
functions with multiple local optima non-differentiable functions

Robust, assumption-free, and very general Hybrid approaches -- GAs with conventional optimization techniques

Using GAs ?
When to use a GA? GA and traditional techniques How long does it take? Will it perform better?

Using GAs
population size mutation, crossover rates how many generations multiple runs

Is it a black-box?

?
Data characteristics Fitness function GA parameters

Huh?

GA Application Examples
Function optimizers
difficult, discontinuous, multimodal, noisy functions

Combinatorial optimization
layout of VLSI circuits, factory scheduling

Design and Control


bridge structures, neural networks, communication networks design; control of chemical plants, pipelines

Machine learning
classification rules, economic modeling, scheduling strategies

Portfolio design, optimized trading models, direct marketing models, sequencing of TV advertisements, adaptive agents, data mining, etc.

GAs and Data Mining


Discovery Prediction Hypothesis testing and refinement

Data Mining
Pattern templates
([attribute in {v1,v2}] and [attribute=value]) or ([attribute in {v1,v2,v3}] and [attribute>value]) or

when S, if C then P
when region=ne if inc > 41K and child>2 then x-sales>100 C P

when S, C and P are positively correlated the mean of A when S and C, is significantly different from the mean of A when S

Data mining
How good are the patterns
accuracy coverage support
# cases in C and P # cases in C # cases in C and P # cases in P # cases in C # cases in S

Understandability

GA for Data Mining


Fitness evaluation
Expected values Chi-square
eij =
i i 2

ri c j

SI C C

P n11 c1

P n12 c2 r1 = n11 + n12 r2 = n21 + n22 n

2 =

n ( nij cij ) 2
eij

n21 n22

Cramer' s V =

higher values imply C and P are related Correlation linear correlation -- product moment corr. coefficient monotonically correlated -- Spearmans rank corr. coeff. Correlation coefficient x support Interesting rule S I C I P
S IC S I P S

DM application
Symbolic models of consumer choice
{ ( 3 5 in c 4 0 K ) ( a g e < 4 3 ) o r ( in c > 6 3 K ) ( a g e > 5 5 ) } t h e n B u y

assumption-free behavioral insights for targeting promotions advantage over decision trees algorithms?
DTs are stepwise optimal, but not globally so high noise-sensitivity of DTs

advantages over neural networks

Performance evaluation
Accuracy/Error rate
will higher accuracy give better performance for the target task?
The use of error rate often suggests insufficiently careful thought about the real objectives of the research David J. Hand, Construction and Assessment of Classification Rules.
Actual P Predicted P N N

True P False N

False P True N

sensitivity, specificity misclassification costs Of course, with 99:1 split in data, default dummy model gives 99% accuracy.

Model Representation
Non-linear tree-structured models (GP)
Non-linear interaction terms Function set : internal nodes
{+,-,*,/,log}

* / log x3 x1 5

Terminal set: leaf nodes


{constants, variables}

(x1 log(x3))/5)

DM Performance: Decile Analysis


Decile top 2 3 4 5 6 7 8 9 bottom Total Number of Customers 2500 2500 2500 2500 2500 2500 2500 2500 2500 2500 25,000 Number of Responses 2179 1753 396 111 110 85 67 69 49 55 4874 Response Rate (%) 87.2 70.1 15.8 4.4 4.4 3.4 2.7 2.8 2.0 2.2 19.5 Cumulative Cumulative Responses Response Rate (%) 2179 3932 4328 4439 4549 4634 4701 4770 4819 4874 87.2 78.6 57.7 44.4 36.4 30.9 26.9 23.9 21.4 19.5 Cumulative Response Lift 447 403 296 228 187 158 138 122 110 100

Cumulative Lift decile =

cum. avg. performancedecile overall avg. performance

* 100.

Decile Maximization(DMAX)
Objective
Find model f(x) (predictor variables x) such that performance in upper deciles (specified depth-of-file) is maximized Number of
Decile

Explicitly manages resource constraint mailings to particular depths-of file Performance at different mailing depths models optimized for different mailing depths

top 2 3 4 5 6 7 8 9 bottom

Responders/ Profit max max max

DMAX: Illustrative Example


45 40 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40

$10 DMAX 40% ($32) $4 OLS($28) $6 $2

$5 $9 $7 $3 $1 $8

Profit $10 $9 $8 $7 $6 $5 $4 $3 $2 $1

X1 45 35 31 30 6 45 30 23 16 12

X2 5 21 38 30 10 37 10 30 13 30

OLS: .14 X1 + .06 X2 DMAX 40%: .19 X1 + .07 X2

GA DMAX
Representation: w1 w2 w3 .. wk Integrated variable selection Fitness evaluation
classification accuracy model reliability maximize specified decile performance
response, profit, etc.

Hybrid algorithm

Comparative Performance: Case I


Response modeling
maximize response in top 3 deciles 4.6% response to mailing
DMAX (30%): - 0.01X1 - 2.51X2 - 0.008X3 - 0.08X4 LOGIT : - 0.40 - 0.01X2 - 0.007X3- 3.25X4 Neural Network: 3 layers, 2 hidden nodes, 12 coefficients

Case I: Genetic Algorithm DMAX (30%)


Number Number Decile Cum Cum of of Response Response Response Customers Responses Rate Rate Lift 4,617 865 18.7% 18.7% 411 4,617 382 8.3% 13.5% 296 4,617 290 6.3% 11.1% 244 4,617 128 2.8% 9.0% 198 4,617 97 2.1% 7.6% 167 4,617 81 1.8% 6.7% 146 4,617 79 1.7% 5.9% 130 4,617 72 1.6% 5.4% 118 4,617 67 1.5% 5.0% 109 4,617 43 0.9% 4.6% 100 46,170 2,104 4.6%

Decile top 2 3 4 5 6 7 8 9 bottom TOTAL

Case I: Cum Response Lift Comparison


Genetic Logistic Neural Decile Algorithm Regression Network DMAX(30%) top 411 384 385 2 296 284 277 3 244 227 221 4 198 194 186 5 167 166 164 6 146 146 146 7 130 131 131 8 118 119 118 9 109 108 108 bottom 100 100 100

Case II 2% Response Rate

Cum Response Lift Comparison


Genetic Genetic Genetic Genetic Logistic Decile Algorithm Algorithm Algorithm Algorithm Regression DMAX(10%) DMAX(20%) DMAX(30%) DMAX(40%) 1 220 186 191 192 194 2 174 195 166 166 165 3 157 173 179 150 148 4 148 158 158 161 154* 5 139 145 146 146 146 6 131 135 138 138 138 7 122 124 127 127 127 8 114 116 117 117 117 9 108 108 109 109 109 bottom 100 100 100 100 100

Case II: 2% Response Rate Smoothness: Logistic Regression


Number Number Decile Cum Cum of of Response Response Response Customers Responses Rate Rate Lift 7,203 283 3.9% 3.9% 194 7,220 200 2.8% 3.3% 165 7,225 165 2.3% 3.0% 148 7,215 255* 3.5% 3.1% 154* 7,227 167 2.3% 3.0% 146 7,220 140 1.9% 2.8% 138 7,209 89 1.2% 2.6% 127 7,228 68 0.9% 2.4% 117 7,205 65 0.9% 2.2% 109 7,232 32 0.4% 2.0% 100 72,184 1,464 2.0%

Decile top 2 3 4 5 6 7 8 9 bottom TOTAL

Case II: 2% Response Rate Smoothness: GA DMAX (10%)


Number Number Decile Cum Cum of of Response Response Response Customers Responses Rate Rate Lift 7,203 322 4.5% 4.5% 220 7,220 188 2.6% 3.5% 174 7,225 178 2.5% 3.2% 157 7,215 178 2.5% 3.0% 148 7,227 151 2.1% 2.8% 139 7,220 133 1.8% 2.7% 131 7,209 103 1.4% 2.5% 122 7,228 84 1.2% 2.3% 114 7,205 81 1.1% 2.2% 108 7,232 46 0.6% 2.0% 100 72,184 1,464 2.0%

Decile top 2 3 4 5 6 7 8 9 bottom TOTAL

Case II: 2% Response Rate Smoothness: GA DMAX (20%)


Number Number Decile Cum Cum of of Response Response Response Customers Responses Rate Rate Lift 7,203 271 3.8% 3.8% 186 7,220 299* 4.1% 4.0% 195* 7,225 191 2.6% 3.5% 173 7,215 162 2.2% 3.2% 158 7,227 140 1.9% 2.9% 145 7,220 119 1.8% 2.7% 135 7,209 90 1.2% 2.5% 124 7,228 85 1.2% 2.3% 116 7,205 69 1.0% 2.2% 108 7,232 38 0.5% 2.0% 100 72,184 1,464 2.0%

Decile top 2 3 4 5 6 7 8 9 bottom TOTAL

Comparative Performance: Case III


Profit modeling
maximize profit in top 2 deciles mailing (profit / size)
Non-responder: -$0.29 / 92.55% Unpaid responder: -$5.65 / 7.10% Paid responder: +$275 / 0.35% Average profit for mailing: +$0.32 DMAX (20%): - .36X1 - .23X2 + .005X3 + .24X4 LOGIT(PR): - .01X1 - .03X2 + .322X3 + .25X4

Case IV: Profit Model Genetic Algorithm DMAX (20%)


Decile top 2 3 4 5 6 7 8 9 bottom TOTAL Number Percent Percent of PAID UNPAID Customers Responders Responders 8,171 0.82% 10.1% 8,171 0.62% 8.7% 8,171 0.37% 8.2% 8,171 0.34% 8.4% 8,171 0.29% 5.9% 8,171 0.32% 7.4% 8,171 0.23% 4.0% 8,171 0.18% 4.8% 8,171 0.24% 8.3% 8,171 0.17% 4.9% 81,710 0.35% 7.1% Decile Average Profit $1.43 $0.96 $0.28 $0.20 $0.20 $0.19 $0.13 -$0.04 -$0.06 -$0.08 Cum Average Profit $1.43 $1.20 $0.89 $0.72 $0.62 $0.54 $0.49 $0.42 $0.37 $0.32 Cum Profit Lift 444 371 277 223 191 169 151 130 114 100

Case IV: Profit Model

Cum Profit Lift Comparison


Decile top 2 3 4 5 6 7 8 9 bottom Genetic Algorithm DMAX (20%) 444 371 277 223 191 169 151 130 114 100 Logistic Regression 385 294 235 190 184 163 146 123 111 100

Modeling on Multiple Objectives


Model [y1,..,yk] = f (x)
simultaneously optimize on multiple objectives

Some common DM modeling desirables


response and high purchase revenues likely churners with high usage of services high tenure and usage purchase and non-return cross-selling, etc.

[or CPR (Combined Profit and Response) Models]

Multiple objectives
Traditional approaches
multiple single-objective models, and combine weighted average of objectives

conflicting objectives
different levels of tradeoffs frontier of non-dominated solutions choice of final model based on diverse decisionmaker objectives, can also be subjective

Pareto Frontier
Non-dominated solutions
multiple objectives i, f a(x) better than f b(x)

if
non-dominated models dominated models

i : i ( f a ( x )) i ( f b ( x )) j : j ( f a ( x )) > j ( f b ( x ))

Single GA run obtains tradeoff frontier of


non-dominated solutions f k(x)

Multi-objective GA
Pareto-Based Selection (Louis and Rawlins, 93)
randomly select a pair of solutions from population generate two new offspring determine the Pareto-optimal set from parents and offspring, and choose two solutions for new population

Elitistism
retain best solution intact in next population fosters local search around best solution

retain non-dominated set of solutions intact in next generation

Fitness evaluation
DMAX approach fitness at specified depth-of-file d

Experimental Study: Data


Cellular-phone provider seeking to identify potential high-value churners
two dependent variables
binary Churn variable continuous variable measuring revenue ($)

predictors: minutes-of-use (peak and off-peak), average charges,


and payment information, etc.
obtained after EDA, normalized to 0 mean 1 s.d

50,000 sample: 25,000 for training, 25,000 for test set

Multiple Objectives: Performance


Churn lift $-Lift
Rd R / Nd N
model capturing more churners in top deciles is better

Cd C / Nd N

model giving high revenue customers in upper deciles is better

overall modeling objective


maximize expected revenue saved through identification of highvalue churners Churn-Lift * $-Lift

Experimental Study

Non-dominated models: Decile 1 (Training)


Decile 1 (trg)
400 350 300 $-Lift 250 200 150 100 50 0 0 100 200 300 Churn-Lift 400 500 600

GP GA Logistic OLS

5 independent GA runs, aggregate the sets of non-dominated solutions

Experimental Study

Non-dominated models: Decile 1 (Test)


Decile 1 (Test)
400 350 300 250 $-Lift 200 150 100 50 0 0 100 200 Churn-Lift 300 400 500

GP GA Logistic OLS

Experimental Study

Non-dominated models: Decile 2 (Test)


Decile 2 (Test)
300 250 200 $-Lift 150 100 50 0 0 50 100 150 200 250 300 350 400 450 Churn-Lift

GP GA Logistic OLS

Experimental Study

Non-dominated models: Decile 3 (Test)


Decile 3 (Test)
250

200

GP GA Logistic OLS

$-Lift

150

100

50

0 0 50 100 150 200 250 300 350 Churn-Lift

Experimental Study

Non-dominated models: Decile 7 (Test)


Decile 7 (Test)

140

120

GP GA Logistic OLS

$-Lift

100

80

60
80 90 100 110 120 130 140 150

Churn-Lift

Experimental Study

Performance Summary
Performance Churn-Lift, $-Lift GA-best GP-best Product of Lifts Churn-Lift, $-Lift Product of Lifts Churn-Lift, $-Lift Logistic Regression Product of Lifts Churn-Lift, $-Lift OLS Regression Product of Lifts Churn-Lift, $-Lift OLS * Logistic Product of Lifts

Decile 1
304.9, 261.7 797.8 343.7, 256.5 881.5 447.1,111.8 499.8 116.2, 360.5 418.8 79, 357 282

Decile 2
265.4, 207.4 550.4 343.5, 182.1 625.5 403.4, 72.6 292.7 108.1, 271.7 293.71 76, 263 201

Decile 3
272.3. 155.0 422.2 275.1, 178.3 490.4 295.9, 57.4 169.96 99.7, 223.2 222.5 74, 217 160

Decile 7
138.8, 126.9 176.1 139.4, 131.2 182.9 137.8, 66.7 91.9 91.8, 136.2 125.1 78, 136 106

General Optimization of Lifts


Fitness function
Seeks a general maximization of lifts at all deciles

Specific vs. General Lift Opt


Performance GA-best Lift-Opt $-Lift, Churn-Lift Product of Lifts

Decile 1
304.9, 261.7 797.8 303.2, 261 791.4

Decile 2
265.4, 207.4 550.4 288.3, 188.8 544.3

Decile 3
272.3. 155.0 422.2 276.7, 151.3 418.6

Decile 7
138.8, 126.9 176.1 138.1, 104.5, 144.3

$-Lift, Churn-Lift GA-best General-Opt Product of Lifts

GP-best Lift-Opt

Churn-Lift, $-Lift Product of Lifts

343.7, 256.5 881.5 332, 252.5 838.3

343.5, 182.1 625.5 265, 223.1 591.2

275.1, 178.3 490.4 233.9, 186.5 436.2

139.4, 131.2 182.9 132.3, 133.1 176.1

Churn-Lift, $-Lift GP-best General-Opt Product of Lifts

Table: Best Prod-Lifts in Deciles

Specific vs. General Lift Opt.


Performance GA-best Lift-Opt GA-best General-Opt

Decile 1 $-Lift ChurnLift


361.4 361.7 464.7 421

Decile 2 $-Lift Churn -Lift


271.6 273.3 401.3 398.1

Decile 3 $-Lift ChurnLift


223.9 223.9 309.8 304.1

Decile 7 $-Lift ChurnLift


136.6 136.6 139.5 138.4

GA-best Lift-Opt GA-best General Opt

372.7 372.1

475.2 421.3

276.5 276.8

417.9 378.3

226.1 226.6

310.3 296.7

137.2 137.1

139.8 139.8

Table: Best $-Lift and Churn-Lifts in Deciles

Case Study EC challenge

EDA, Variable-selection
Problem
15,178 obs., 79 variables, response dependent Seeking maximum lift in the top decile Logistic regression model
15 variables, after EDA, transformation
(many of them combinations of multiple vars.)

This is the hard part!

Lift of 126 in the top decile

EC approach
Include all variables Explore simple terms: non-linear GP models
small populations, looking for robust terms

Final model(s) using obtained terms

Case Study EC
Various 2-5 var. terms show some predictability
Lifts ranging in 122-127

Models on these terms


Non-linear, Linear model: lifts in 126-132

Examples
3 tan(HC211) + EC31 (OCC81 - log10(ORDTERM1/IC191))*STATE2*HHAS21 STATE2 * HHAS21 (OCC81 - log10(B)) * B * (A + B + (ORDTERM1 * (A + B))) A = (STATE2 - SECGENDE) and B = STATE2*HHAS21 B + tan(2B + HHAS21) + EC31 + (ORDTERM1)*(B + tan[B + HHAS21 + ((HHAS21*HV31)/2.1)] ) AB^3 (1 + OCC81) + AB(OCC81) + 2DEB(OCC81)^2. 4A + B + C + 2D + E + 2*OCC81 (10 vars. total) Trg:122.5 Test: 122.5 Trg: 124.9 Test:126.4 Trg: 121.3 Test: 121.3 Trg: 131.5 Test: 126.9

Trg: 131.1 Test:127.8

Trg: 134.4 Test 131.6 Trg: 132.5 Test: 131.7

You might also like