0% found this document useful (0 votes)

29 views

Data Mining Tutorial: D. A. Dickey

This document provides an overview of data mining techniques including decision trees, recursive splitting, and regression trees. It discusses how decision trees are constructed using recursive splitting to create pure terminal nodes or "leaves". The goal is to reduce diversity within leaves and maximize differences between leaves. Various measures are discussed for evaluating splits such as gini index, Shannon entropy, and multiple testing corrections like Bonferroni adjustments. The document also covers validation, pruning, and accounting for costs and probabilities. Regression trees for predicting continuous variables are introduced.

Uploaded by

Abhinav Pandey

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Data Mining Tutorial: D. A. Dickey

Uploaded by

Abhinav Pandey

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 109

Data Mining Tutorial

D. A. Dickey

Data Mining - What is it?

Large datasets Fast methods Not significance testing Topics
Trees (recursive splitting) Logistic Regression Neural Networks Association Analysis Nearest Neighbor Clustering Etc.

Trees
A divisive method (splits) Start with root node all in one group Get splitting rules Response often binary Result is a tree Example: Loan Defaults Example: Framingham Heart Study Example: Automobile fatalities

Recursive Splitting
Pr{default} =0.007
Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.003 Pr{default} =0.012

Pr{default} =0.0001

No default Default

X2 = Age

Some Actual Data

Framingham Heart Study First Stage Coronary Heart Disease
P{CHD} = Function of:
Age - no drug yet! Cholesterol Systolic BP
Import

Example of a tree
All 1615 patients
Split # 1: Age

Systolic BP

terminal node

options: (1) assessment measure: Avg. Sq. Error (2) N=4, (3) Gini splits

How to make splits?

Which variable to use? Where to split?
Cholesterol > ____ Systolic BP > _____

Goal: Pure leaves or terminal nodes Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems

Where to Split?
First review Chi-square tests Contingency tables
Heart Disease No Yes Low BP Heart Disease No Yes

100 100

75 75

25 25

High BP

DEPENDENT

INDEPENDENT

c2 Test Statistic
Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25)
Heart Disease No Yes
2 ( observed exp ected ) c 2 allcells exp ected

Low BP

High BP

95 (75) 55 (75)
150

5 (25) 45 (25)
50

100
100

2(400/75)+ 2(400/25) = 42.67

Compare to Tables Significant!

200

WHERE IS HIGH BP CUTOFF???

Measuring Worth of a Split

P-value is probability of Chi-square as great as that observed if independence is true. (Pr {c2>42.67} is 6.4E-11) P-values all too small. Logworth = -log10(p-value) = 10.19 Best Chi-square max logworth.

Logworth for Age Splits

?
Age 47 maximizes logworth

How to make splits?

Which variable to use? Where to split?
Cholesterol > ____ Systolic BP > _____

Idea Pick BP cutoff to minimize p-value for c2 What does signifiance mean now?

Multiple testing
50 different BPs in data, 49 ways to split Sunday football highlights always look good! If he shoots enough times, even a 95% free throw shooter will miss. Tried 49 splits, each has 5% chance of declaring significance even if theres no relationship.

Multiple testing
a= Pr{ falsely reject hypothesis 2}

a= Pr{ falsely reject hypothesis 1} Pr{ falsely reject one or the other} < 2a Desired: 0.05 probabilty or less Solution: use a = 0.05/2 Or compare 2(p-value) to 0.05

Multiple testing
50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni original idea Kass apply to data mining (trees) Stop splitting if minimum p-value is large. For m splits, logworth becomes -log10(m*p-value) ! ! !

Other Split Evaluations

Gini Diversity Index
{ A A A A B A B B C B} Pick 2, Pr{different} = 1-Pr{AA}-Pr{BB}-Pr{CC} *
1-[0.25-0.16-0.01]=0.58 LESS DIVERSE

{AABCBAABCC}
1-[0.15-0.09-0.09] = 0.66 MORE DIVERSE, LESS PURE

Shannon Entropy
Larger more diverse (less pure) -Si pi log2(pi)
{0.5, 0.4, 0.1} 1.36 {0.4, 0.3, 0.3} 1.74 * (EM uses sampling with replacement) (less diverse) (more diverse)

Goals
Split if diversity in parent node > summed diversities in child nodes Observations should be
Homogeneous (not diverse) within leaves Different between leaves Leaves should be diverse

Framingham tree used Gini for splits

Validation
Traditional stats small dataset, need all observations to estimate parameters of interest. Data mining loads of data, can afford holdout sample Variation: n-fold cross validation
Randomly divide data into n sets Estimate on n-1, validate on 1 Repeat n times, using each set as holdout.

Pruning
Grow bushy tree on the fit data Classify holdout data Likely farthest out branches do not improve, possibly hurt fit on holdout data Prune non-helpful branches. What is helpful? What is good discriminator criterion?

Goals
Want diversity in parent node > summed diversities in child nodes Goal is to reduce diversity within leaves Goal is to maximize differences between leaves Use validation average squared error, proportion correct decisions, etc. Costs (profits) may enter the picture for splitting or pruning.

Accounting for Costs

Pardon me (sir, maam) can you spare some change? Say sir to male +$2.00 Say maam to female +$5.00 Say sir to female -$1.00 (balm for slapped face) Say maam to male -$10.00 (nose splint)

Including Probabilities
Leaf has Pr(M)=.7, Pr(F)=.3. Sir True Gender M 0.7 (2) 0.7 (-10) You say: Maam

0.3 (5)

+$1.10

-$5.50

Expected profit is 2(0.7)-1(0.3) = $1.10 if I say sir Expected profit is -7+1.5 = -$5.50 (a loss) if I say Maam Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits.

Additional Ideas
Forests Draw samples with replacement (bootstrap) and grow multiple trees. Random Forests Randomly sample the features (predictors) and build multiple trees. Classify new point in each tree then average the probabilities, or take a plurality vote from the trees

Lift
3.3

* Cumulative Lift Chart - Go from leaf of most to least predicted 1 response. - Lift is proportion responding in first p% overall population response rate

Regression Trees
Continuous response Y Predicted response Pi constant in regions i=1, , 5
Predict 80

Predict 50
X2 Predict 130 Predict 20

Predict 100 X1

Prediction PREDi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-PREDi)2

Predict 80

Predict 50

Predict 130

Predict 20

Predict 100

Predict Pi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-Pi)2

Real data example: Traffic accidents in Portugal* Y = injury induced cost to society

Help - I ran Into a tree

* Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education

Cool < ------------------------ > Nerdy

Analytics ------------------- Statistics Predictive Modeling ------------------ Regression

Another major tool: Regression (OLS: ordinary least squares)

If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others
http://www.ofesite.com/spirit/palm/lines/linelife.htm

Wilson & Mather JAMA 229 (1974)

X=life line length Y=age at death

proc sgplot; scatter Y=age X=line; reg Y=age X=line; run ;

Result: Predicted Age at Death = 79.24 1.367(lifeline) (Is this real??? Is this repeatable???)

We Use LEAST SQUARES

Squared residuals sum to 9609

Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. Simulate 20 cases with n= 50 bodies each.
NOTE: Regression equations : Age(rep:1) = 80.56253 - 1.345896*line. Age(rep:2) = 61.76292 + 0.745289*line. Age(rep:3) = 72.14366 - 0.546996*line. Age(rep:4) = 95.85143 - 3.087247*line. Age(rep:5) = 67.21784 - 0.144763*line. Age(rep:6) = 71.0178 - 0.332015*line. Age(rep:7) = 54.9211 + 1.541255*line. Age(rep:8) = 69.98573 - 0.472335*line. Age(rep:9) = 85.73131 - 1.240894*line. Age(rep:10) = 59.65101 + 0.548992*line. Age(rep:11) = 59.38712 + 0.995162*line. Age(rep:12) = 72.45697 - 0.649575*line. Age(rep:13) = 78.99126 - 0.866334*line. Age(rep:14) = 45.88373 + 2.283475*line. Age(rep:15) = 59.28049 + 0.790884*line. Age(rep:16) = 73.6395 - 0.814287*line. Age(rep:17) = 70.57868 - 0.799404*line. Age(rep:18) = 72.91134 - 0.821219*line. Age(rep:19) = 55.46755 + 1.238873*line. Age(rep:20) = 63.82712 + 0.776548*line.

Predicted Age at Death = 79.24 1.367(lifeline) Would NOT be unusual if there is no true relationship .

Distribution of t Under H0

Conclusion: Estimated slopes vary Standard deviation (estimated) of sample slopes = Standard error Compute t = (estimate hypothesized)/standard error p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope) p-value is sum of two tail areas. Traditionally p<0.05 implies hypothesized value is wrong. p>0.05 is inconclusive.

proc reg data=life; model age=line; run;

Parameter Estimates Parameter Estimate 79.23341 -1.36697 Standard Error 14.83229 1.59782

Variable DF Intercept 1 Line 1

t Value Pr > |t| 5.34 <.0001 -0.86 0.3965 Area 0.19825 Area 0.19825 0.39650

-0.86

0.86

Conclusion: insufficient evidence against the hypothesis of no linear relationship.

H0: H1:

H0: Innocence H1: Guilt Beyond reasonable doubt P<0.05

H0: True slope is 0 (no association) H1: True slope is not 0 P=0.3965

Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. WHY? Simulate 20 cases with n= 50 bodies each.
Want estimate of variability around the true line. True variance is Use sums of squared residuals (SS).

Sum of squared residuals from the mean is SS(total) 9755 Sum of squared residuals around the line is SS(error) 9609 (1) SS(total)-SS(error) is SS(model) = 146 (2) Variance estimate is SS(error)/(degrees of freedom) = 200 (3) SS(model)/SS(total) is R2, i.e. proportion of variablity explained by the model.
Analysis of Variance Sum of Squares 146.51753 9608.70247 9755.22000 0.0150 Mean Square 146.51753 200.18130

Source Model Error Corrected Total

Root MSE 14.14854

DF 1 48 49

F Value 0.73

Pr > F 0.3965

R-Square

Those Mysterious Degrees of Freedom (DF)

First Martian information about average height 0 information about variation.

2nd Martian gives first piece of information (DF) about error variance around mean.

n Martians n-1 DF for error (variation)

Martian Height

2 points no information on variation of errors n points n-2 error DF

Martian Weight

How Many Table Legs? (regress Y on X1, X2)

Source Model Error Corrected Total

DF 2 37 39

Sum of Squares 32660996 1683844 34344840

Mean Square 16330498 45509

error

Three legs will all touch the floor.

Fourth leg gives first chance to measure error (first error DF). Fit a plane n-3 (37) error DF (2 model DF, n-1=39 total DF)

Regress Y on X1 X2 X7 n-8 error DF (7 model DF, n-1 total DF)

Extension:

Multiple Regression

Issues: (1) Testing joint importance versus individual significance Two engine plane can still fly if engine #1 fails Two engine plane can still fly if engine #2 fails Neither is critical individually Jointly critical (cant omit both!!) (2) Prediction versus modeling individual effects (3) Collinearity (correlation among inputs) Example: Hypothetical companys sales Y depend on TV advertising X1 and Radio Advertising X2. Y = b0 + b1X1 + b2X2 +e

Data Sales; length sval $8; length cval $8; input store TV radio sales; (more code) cards; Sales 1 869 868 9089 2 836 820 8290 (more data) 40 969 961 10130

Radio TV proc g3d data=sales; scatter radio*TV=sales/shape=sval color=cval zmin=8000; run;

Conclusion: Can predict well with just TV, just radio, or both! SAS code: proc reg data=next; model sales = TV radio;
Analysis of Variance Sum of Squares 32660996 1683844 34344840 R-Square Mean Square 16330498 45509

Source Model Error Corrected Total Root MSE

DF 2 37 39 213.32908

F Value 358.84

Pr > F <.0001 (Cant omit both)

0.9510 Explaining 95% of variation in sales

Parameter Estimates Parameter Estimate 531.11390 5.00435 4.66752 Standard Error 359.90429 5.01845 4.94312

Variable Intercept TV radio

DF 1 1 1

t Value 1.48 1.00 0.94

Pr > |t| 0.1485 0.3251 (can omit TV) 0.3512 (can omit radio)

Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (standard deviation 213). TV approximately equal to radio so, approximately Estimated Sales = 531 + 9.7 TV or

Estimated Sales = 531 + 9.7 radio

Summary:
Good predictions given by Sales = 531 + 5.0 x TV + 4.7 x Radio or Sales = 479 + 9.7 x TV or Sales = 612 + 9.6 x Radio or (lots of others)

Why the confusion? The evil Multicollinearity!! (correlated Xs)

Multicollinearity can be diagnosed by looking at principal components (axes of variation) Variance along PC axes eigenvalues of correlation matrix Direction axes point eigenvectors of correlation matrix Principal Component Axis 1

Proc Corr; Var TV radio sales;

Pearson Correlation Coefficients, N = 40 Prob > |r| under H0: Rho=0

TV
TV 1.00000

radio
0.99737 <.0001 1.00000

sales
0.97457 <.0001 0.97450 <.0001

TV $ Principal Component Axis 2

radio

0.99737 <.0001

sales

0.97457 <.0001

0.97450 <.0001

1.00000

Radio $

TEXT MINING

Hypothetical collection of e-mails (corpus) from analytics students: John, message 1: Theres a good cook there. Susan, message 1: I have an analytics practicum then. Susan, message 2: Ill be late from analytics. John, message 2: Shall we take the kids to a movie? John, message 3: Later we can eat what I cooked yesterday. (etc.) Compute word counts: analytics cook_n cook_v kids late movie practicum John 0 1 1 1 1 1 0 Susan 2 0 0 0 1 0 1

Text Mining Mini-Example: Word counts in 16 e-mails --------------------------------words-----------------------------------------

G r o c e r y l i s t 3 0 16 3 17 0 10 20 12 4 10 16 4 5 8 0

s t u d e n t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

J o b 5 5 0 8 0 10 1 2 4 26 19 2 16 14 1 3

P r a c t i c u m 8 6 2 9 0 6 0 3 1 13 22 0 19 17 0 5

A n a l y t i c s 10 9 0 7 4 9 1 1 4 9 10 0 21 12 4 8

M o v i e 12 5 14 0 16 5 6 13 16 2 11 14 0 0 21 0

D a t a 6 4 0 12 0 5 2 0 2 16 9 1 13 20 3 1

S A S 0 2 2 14 0 19 1 1 4 20 12 3 9 19 6 2

K i d s 1 0 12 2 15 5 9 12 9 6 0 12 0 0 9 0

M i n e r 5 9 0 12 2 20 0 13 0 24 14 0 16 12 3 5

I n t e r v i e w 8 12 4 15 3 18 0 0 9 30 22 12 12 9 0 4

L a t e 18 12 24 22 9 13 2 0 3 9 3 17 0 6 3 6

C o o k _ v 5 1 18 0 18 8 6 12 0 7 2 14 0 3 9 1

C o o k _ n 0 0 4 0 9 1 0 1 0 2 0 3 0 0 3 0

Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative

1 2 3 4 5 6 7 8 9 10 11 12 13 7.49896782 1.94396299 1.21865516 0.61469785 0.51315004 0.42261242 0.31689737 0.22009119 0.10020277 0.07804446 0.05870659 0.01199982 0.00201154 5.55500483 0.72530783 0.60395731 0.10154782 0.09053762 0.10571506 0.09680618 0.11988842 0.02215831 0.01933787 0.04670677 0.00998828 0.5768 0.1495 0.0937 0.0473 0.0395 0.0325 0.0244 0.0169 0.0077 0.0060 0.0045 0.0009 0.0002 0.5768 0.7264 0.8201 0.8674 0.9069 0.9394 0.9638 0.9807 0.9884 0.9944 0.9989 0.9998 1.0000

58% of the variation in these 12-dimensional vectors occurs in one dimension.

Prin1 Job Practicum Analytics Movie Data SAS Kids Miner Grocerylist Interview Late Cook_v Cook_n 0.317700 0.318654 0.306205 -.283351 0.314980 0.279258 -.309731 0.290127 -.269651 0.261794 -.049560 -.267515 -.225621

PROC CLUSTER (single linkage) agrees !

d o c u m e n t 1 2 4 6 10 11 13 14 16 3 5 7 8 9 12 15

C L U S T E R 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

P r i n 1 0.15311 0.93370 2.08576 1.74995 3.70319 2.76166 3.77000 3.37595 0.44444 -3.62271 -4.18243 -1.90553 -2.54416 -1.41349 -2.98274 -2.32671

J o b 5 5 8 10 26 19 16 14 3 0 0 1 2 4 2 1

P r a c t i c u m
8 6 9 6 13 22 19 17 5 2 0 0 3 1 0 0

A n a l y t i c s
10 9 7 9 9 10 21 12 8 0 4 1 1 4 0 4

M o v i e 12 5 0 5 2 11 0 0 0 14 16 6 13 16 14 21

D a t a 6 4 12 5 16 9 13 20 1 0 0 2 0 2 1 3

S A S 0 2 14 19 20 12 9 19 2 2 0 1 1 4 3 6

K i d s 1 0 2 5 6 0 0 0 0 12 15 9 12 9 12 9

M i n e r 5 9 12 20 24 14 16 12 5 0 2 0 13 0 0 3

G r o c e r y l i s t 3 0 3 0 4 10 4 5 0 16 17 10 20 12 16 8

I n t e r v i e w
8 12 15 18 30 22 12 9 4 4 3 0 0 9 12 0

L a t e 18 12 22 13 9 3 0 6 6 24 9 2 0 3 17 3

C o o k _ v 5 1 0 8 7 2 0 3 1 18 18 6 12 0 14 9

C o o k _ n 0 0 0 1 2 0 0 0 0 4 9 0 1 0 3 3

Unsupervised Learning
We have the features (predictors) We do NOT have the response even on a training data set (UNsupervised) Clustering
Agglomerative
Start with each point separated

Divisive
Start with all points in one cluster then spilt

Direct
State # clusters beforehand

EM PROC FASTCLUS
Step 1 find (50) seeds as separated as possible Step 2 cluster points to nearest seed
Drift: As points are added, change seed (centroid) to average of each coordinate Alternatively: Make full pass then recompute seed and iterate.

Step 3 aggregate clusters using Wards method

Clusters as Created

As Clustered PROC FASTCLUS

Cubic Clustering Criterion (to decide # of Clusters)

Divide random scatter of (X,Y) points into 4 quadrants Pooled within cluster variation much less than overall variation Large variance reduction Big R-square despite no real clusters CCC compares random scatter R-square to what you got to decide #clusters 3 clusters for macaroni data.

Grades vs. IQ and Study Time

Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time; cards; 105 10 75 110 12 79 120 6 68 116 13 85 122 16 91 130 8 79 114 20 98 102 15 76 ; Proc reg data=tests; model Grade = IQ; Proc reg data=tests; model Grade = IQ Study_Time;
Variable Intercept IQ DF 1 1 Parameter Estimate 62.57113 0.16369 Parameter Estimate 0.73655 0.47308 2.10344 Standard Error 48.24164 0.41877 Standard Error 16.26280 0.12998 0.26418 t Value 1.30 0.39 Pr > |t| 0.2423 0.7094

Variable Intercept IQ Study_Time

DF 1 1 1

t Value 0.05 3.64 7.96

Pr > |t| 0.9656 0.0149 0.0005

Contrast: TV advertising looses significance when radio is added. IQ gains significance when study time is added.

Model for Grades: Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time

Question: Does an extra hour of study really deliver 2.10 points for everyone regardless of IQ? Current model only allows this.

proc reg; model Grade = IQ Study_Time IQ_S;

Sum of Squares 610.81033 31.06467 641.87500 2.78678 Parameter Estimate 72.20608 -0.13117 -4.11107 0.05307 Mean Square 203.60344 7.76617

Source Model Error Corrected Total Root MSE

DF 3 4 7

F Value 26.22

Pr > F 0.0043

R-Square Standard Error 54.07278 0.45530 4.52430 0.03858

0.9516

Variable Intercept IQ Study_Time IQ_S

DF 1 1 1 1

t Value 1.34 -0.29 -0.91 1.38

Pr > |t| 0.2527 0.7876 0.4149 0.2410

Interaction model: Predicted Grade = 72.21 0.13 x IQ 4.11 x Study Time + 0.053 x IQ x Study Time = (72.21 0.13 x IQ )+( 4.11 + 0.053 x IQ )x Study Time IQ = 102 predicts Grade = (72.21-13.26)+(5.41-4.11) x Study Time = 58.95+ 1.30 x Study Time IQ = 122 predicts Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time

Slope = 2.36

Slope = 1.30

(1) (2) (3) (4)

Adding interaction makes everything insignificant (individually) ! Do we need to omit insignificant terms until only significant ones remain? Has an acquitted defendant proved his innocence? Common sense trumps statistics!

Classification Variables (dummy variables, indicator variables) Predicted Accidents = 1181 + 2579 X11 X11 is 1 in November, 0 elsewhere. Interpretation: In November, predict 1181+2579(1) = 3660. In any other month predict 1181 + 2579(0) = 1181. 1181 is average of other months. 2579 is added November effect (vs. average of others) Model for NC Crashes involving Deer: Proc reg data=deer; model deer = X11;
Analysis of Variance Sum of Squares 30473250 19539666 50012916 R-Square Mean Square 30473250 336891

Source Model Error Corrected Total Root MSE

DF 1 58 59 580.42294

F Value 90.45

Pr > F <.0001

0.6093 Parameter Estimate 1181.09091 2578.50909 Standard Error 78.26421 271.11519

Variable Intercept X11

Label Intercept

DF 1 1

t Value 15.09 9.51

Pr > |t| <.0001 <.0001

Looks like December and October need dummies too! Proc reg data=deer; model deer = X10 X11 X12;
Analysis of Variance Sum of Squares 46152434 3860482 50012916 262.55890 Parameter Estimate 929.40000 1391.20000 2830.20000 1377.40000 Mean Square 15384145 68937 date F Value 223.16 Pr > F <.0001 JAN03 FEB03 MAR03 APR03 MAY03 JUN03 JUL03 AUG03 SEP03 OCT03 NOV03 DEC03 JAN04 FEB04 MAR04 APR04 MAY04 JUN04 JUL04 AUG04 SEP04 OCT04 NOV04 DEC04 x10 x11 x12 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1

Source Model Error Corrected Total Root MSE

DF 3 56 59

R-Square Standard Error 39.13997 123.77145 123.77145 123.77145

0.9228

Variable Intercept X10 X11 X12

DF 1 1 1 1

t Value 23.75 11.24 22.87 11.13

Pr > |t| <.0001 <.0001 <.0001 <.0001

Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December.

What the heck lets do all but one (need average of rest so must leave out at least one) Proc reg data=deer; model deer = X1 X2 X10 X11;
Analysis of Variance Sum of Squares 48421690 1591226 50012916 Mean Square 4401972 33151

Source Model Error Corrected Total

DF 11 48 59

F Value 132.79

Pr > F <.0001

Root MSE

182.07290

R-Square

0.9682

Parameter Estimates Parameter Estimate 2306.80000 -885.80000 -1181.40000 -1220.20000 -1486.80000 -1526.80000 -1433.00000 -1559.20000 -1646.20000 -1457.20000 13.80000 1452.80000 Standard Error 81.42548 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301

Variable Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11

Label Intercept

DF 1 1 1 1 1 1 1 1 1 1 1 1

t Value 28.33 -7.69 -10.26 -10.60 -12.91 -13.26 -12.44 -13.54 -14.30 -12.65 0.12 12.62

Pr > |t| <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.9051 <.0001

Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December.

positive

negative

Add date (days since Jan 1 1960 in SAS) to capture trend Proc reg data=deer; model deer = date X1 X2 X10 X11;
Analysis of Variance Sum of Squares 49220571 792345 50012916 Mean Square 4101714 16858

Source Model Error Corrected Total

DF 12 47 59

F Value 243.30

Pr > F <.0001

Root MSE

129.83992

R-Square

0.9842

Parameter Estimates Parameter Estimate -1439.94000 -811.13686 -1113.66253 -1158.76265 -1432.28832 -1478.99057 -1392.11624 -1525.01849 -1618.94416 -1436.86982 27.42792 1459.50226 0.22341 Standard Error 547.36656 82.83115 82.70543 82.60154 82.49890 82.41114 82.33246 82.26796 82.21337 82.17106 82.14183 82.12374 0.03245

Variable Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 date

Label Intercept

DF 1 1 1 1 1 1 1 1 1 1 1 1 1

t Value -2.63 -9.79 -13.47 -14.03 -17.36 -17.95 -16.91 -18.54 -19.69 -17.49 0.33 17.77 6.88

Pr > |t| 0.0115 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.7399 <.0001 <.0001

Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0.

Logistic Regression
Trees seem to be main tool. Logistic another classifier Older tried & true method Predict probability of response from input variables (Features) Linear regression gives infinite range of predictions 0 < probability < 1 so not linear regression.

Example: Seat Fabric Ignition

Flame exposure time = X Ignited Y=1, did not ignite Y=0
Y=0, X= 3, 5, 9 10 , 13, 16 Y=1, X = 7, 11, 12, 14, 15, 17, 25, 30

Q=(1-p1)(1-p2)p3(1-p4)(1-p5)p6p7(1-p8)p9p10(1p11)p12p13p14 ps all different : pi=exp(a+bXi) /(1+exp(a+bXi)) Find a,b to maximize Q(a,b)

Logistic idea: Map p in (0,1) to L in whole real line Use L = ln(p/(1-p)) Model L as linear in temperature, e.g. Predicted L = a + b(temperature) Given temperature X, compute L(x)=a+bX then p = eL/(1+eL) p(i) = ea+bXi/(1+ea+bXi) Write p(i) if response, 1-p(i) if not Multiply all n of these together, find a,b to maximize

Generate Q for array of (a,b) values

DATA LIKELIHOOD; ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14; DO I=1 TO 14; INPUT X(I) y(I) @@; END; DO A = -3 TO -2 BY .025; DO B = 0.2 TO 0.3 BY .0025; Q=1; DO i=1 TO 14; L=A+B*X(i); P=EXP(L)/(1+EXP(L)); IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P); END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END; CARDS; 3 0 5 0 7 1 9 0 10 0 11 1 12 1 13 0 14 1 15 1 16 0 17 1 25 1 30 1 ;

Likelihood function (Q)

-2.6

0.23

Concordant pair

Discordant Pair

IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates

Parameter Intercept TIME

DF 1 1

Estimate -2.5879 0.2346

Standard Error 1.8469 0.1502

Wald Chi-Square 1.9633 2.4388

Pr > ChiSq 0.1612 0.1184

Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 79.2 20.8 0.0 48 Somers' D Gamma Tau-a c 0.583 0.583 0.308 0.792

Example: Shuttle Missions

O-rings failed in Challenger disaster Low temperature Prior flights erosion and blowby in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0)

Example: Framingham
X=age Y=1 if heart trouble, 0 otherwise

Framingham
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Estimate Error Chi-Square -5.4639 0.0630 0.5563 0.0110 96.4711 32.6152

Parameter Intercept age

DF 1 1

Pr>ChiSq <.0001 <.0001

Neural Networks
Very flexible functions Hidden Layers Multilayer Perceptron

output inputs

Logistic function of Logistic functions Of data

Arrows represent linear combinations of basis functions, e.g. logistic curves (hyperbolic tangents)

p1 p2 p3 b2 b3

b1 Y

Example: Y = a + b1 p1 + b2 p2 + b3 p3 Y = 4 + p1+ 2 p2 - 4 p3

Should always use holdout sample Perturb coefficients to optimize fit (fit data)
Nonlinear search algorithms

Eliminate unnecessary complexity using holdout data. Other basis sets

Radial Basis Functions Just normal densities (bell shaped) with adjustable means and variances.

Statistics to Data Mining Dictionary Statistics (nerdy) Data Mining (cool)

Independent variables Dependent variable Estimation Clustering Prediction Slopes, Betas Intercept
Composition of Hyperbolic Tangent Functions Radial Basis Function and my personal Type I and Type II Errors

Features Target Training, Supervised Learning Unsupervised Learning Scoring Weights (Neural nets) Bias (Neural nets)
Neural Network Normal Density favorite Confusion Matrix

Association Analysis
Market basket analysis
What theyre doing when they scan your VIP card at the grocery People who buy diapers tend to also buy _________ (beer?) Just a matter of accounting but with new terminology (of course ) Examples from SAS Appl. DM Techniques, by Sue Walsh:

Termnilogy
Baskets: ABC ACD BCD ADE BCE Rule Support Confidence X=>Y Pr{X and Y} Pr{Y|X} A=>D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3
ABC ACD BCD ADE BCE

Dont be Fooled!
Lift = Confidence /Expected Confidence if Independent
Checking Saving No Yes No (1500) 500 1000 Yes (8500) 3500 5000 (10000) 4000 6000

SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!!

Summary
Data mining a set of fast stat methods for large data sets Some new ideas, many old or extensions of old Some methods: Trees (recursive splitting) Logistic Regression Neural Networks Association Analysis Nearest Neighbor Clustering Etc.

TEXT MINING

Hypothetical collection of news releases (corpus) : release 1: Did the NCAA investigate the basketball scores and vote for sanctions? release 2: Republicans voted for and Democrats voted against it for the win. (etc.) Compute word counts: NCAA basketball score vote Republican Democrat win Release 1 1 1 1 1 0 0 0 Release 2 0 0 0 2 1 1 1

Text Mining Mini-Example: Word counts in 16 e-mails --------------------------------words-----------------------------------------

P r e s i d e n t 8 6 2 9 0 6 3 1 13 22 0 19 17 0 R e p u b l i c a n 10 9 0 7 4 9 1 4 9 10 0 21 12 4 B a s k e t b a l l 12 5 14 0 16 5 13 16 2 11 14 0 0 21 T o u r n a m e n t 3 0 16 3 17 0 20 12 4 10 16 4 5 8

d o c u m e n t 1 2 3 4 5 6 7 8 9 10 11 12 13 14

E l e c t i o n 20 5 0 8 0 10 2 4 26 19 2 16 14 1

D e m o c r a t 6 4 0 12 0 5 0 2 16 9 1 13 20 3

V o t e r s 0 2 2 14 0 19 1 4 20 12 3 9 19 6

N C A A 1 0 12 2 15 5 12 9 6 0 12 0 0 9

L i a r 5 9 0 12 2 20 13 0 24 14 0 16 12 3

S p e e c h 8 12 4 15 3 18 0 9 30 22 12 12 9 0

W i n s 18 12 24 22 9 13 0 3 9 3 17 0 6 3

S c o r e _ V 15 9 19 8 0 9 1 0 10 1 23 0 1 10

S c o r e _ N 21 0 30 2 1 14 6 0 14 0 8 2 4 20

Eigenvalues of the Correlation Matrix Eigenvalue 1 2 3 4 5 6 7 8 9 10 11 12 13 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 Difference 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Cumulative 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000

55% of the variation in these 13-dimensional vectors occurs in one dimension.

Variable Basketball NCAA Tournament Score_V Score_N Wins Prin1 -.320074 -.314093 -.277484 -.134625 -.120083 -.080110

Prin 2

Prin 1

Speech Voters Liar Election Republican President Democrat

0.273525 0.294129 0.309145 0.315647 0.318973 0.333439 0.336873

55% of the variation in these 13-dimensional vectors occurs in one dimension.

Variable Basketball NCAA Tournament Score_V Score_N Wins Prin1 -.320074 -.314093 -.277484 -.134625 -.120083 -.080110

Prin 2 Prin 1

Prin1 coordinate = .707(word1) .707(word2)

Speech Voters Liar Election Republican President Democrat

0.273525 0.294129 0.309145 0.315647 0.318973 0.333439 0.336873

PROC CLUSTER (single linkage) agrees !

Cluster 2

Cluster 1

Plot of Prin1*Prin2$document.

Symbol points to label.

Prin1 4 > 12 > 9 > 13 3 > 10 2 4 <> 6 1 > 2 0 > 1 -1 > 8 -2 > 7 > 14 -3 > 5 > 11 > 3 -4 -3 -2 -1 0 1 2 3 Prin2

Can use two, three or more components (dimensions)

D.A.D.

Arjun S Assignment 1 Basic Stat1
88% (8)
Arjun S Assignment 1 Basic Stat1
21 pages
Math Reviewer g10 1
100% (7)
Math Reviewer g10 1
10 pages
Wilmott, Paul - The Mathematics of Financial Derivatives
100% (2)
Wilmott, Paul - The Mathematics of Financial Derivatives
338 pages
RMP470S Lecture 7 - One-Dimensionalstatistics
No ratings yet
RMP470S Lecture 7 - One-Dimensionalstatistics
27 pages
Unit 06b - Midterm Review - 1 Per Page
No ratings yet
Unit 06b - Midterm Review - 1 Per Page
23 pages
Week 4-Normal Distribution and Empirical Rule
100% (1)
Week 4-Normal Distribution and Empirical Rule
27 pages
Summary Biometry
No ratings yet
Summary Biometry
51 pages
Methods Guide
No ratings yet
Methods Guide
16 pages
Data Types:: Basic Statistics
No ratings yet
Data Types:: Basic Statistics
23 pages
wk4lectureESCI_24 (1)
No ratings yet
wk4lectureESCI_24 (1)
112 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Midterm I Review - 1 Per Page
No ratings yet
Midterm I Review - 1 Per Page
24 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
BE184
No ratings yet
BE184
47 pages
W4 Lecture4
No ratings yet
W4 Lecture4
31 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Statistical Inference: CLT, Confidence Intervals, P-Values
No ratings yet
Statistical Inference: CLT, Confidence Intervals, P-Values
82 pages
Mathematical Methods: Dr. Asim Khwaja
No ratings yet
Mathematical Methods: Dr. Asim Khwaja
37 pages
Measures of Central Tendency and Box and Whisker Plots
No ratings yet
Measures of Central Tendency and Box and Whisker Plots
36 pages
V2 Chapter4 Summer 2020 - 21 - Tagged
No ratings yet
V2 Chapter4 Summer 2020 - 21 - Tagged
48 pages
Appendix Aug 6 2012
No ratings yet
Appendix Aug 6 2012
15 pages
A-Level Statistics 1_Normal Distribution_Notes
No ratings yet
A-Level Statistics 1_Normal Distribution_Notes
5 pages
QBA 201_Lesson 5 Numerical Measures Part II
No ratings yet
QBA 201_Lesson 5 Numerical Measures Part II
15 pages
5 Random Var PDF
No ratings yet
5 Random Var PDF
74 pages
Reading - Exploratory Data Analysis
No ratings yet
Reading - Exploratory Data Analysis
33 pages
05 Lecture4 - Estimation
No ratings yet
05 Lecture4 - Estimation
37 pages
Dsur I Chapter 18 Categorical Data
No ratings yet
Dsur I Chapter 18 Categorical Data
47 pages
Stats2 Normal Distribution
No ratings yet
Stats2 Normal Distribution
23 pages
Basic Statistics - 1
No ratings yet
Basic Statistics - 1
21 pages
Mathematical Analysis
100% (1)
Mathematical Analysis
46 pages
ECON1203 PASS Week 3
No ratings yet
ECON1203 PASS Week 3
4 pages
CH 4 - Estimation & Hypothesis One Sample
No ratings yet
CH 4 - Estimation & Hypothesis One Sample
139 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
Stats for data science
No ratings yet
Stats for data science
20 pages
Mean, Median, Mode & Standard Deviation (Gate60 Short Notes) - 1
No ratings yet
Mean, Median, Mode & Standard Deviation (Gate60 Short Notes) - 1
8 pages
Lecture 3
No ratings yet
Lecture 3
14 pages
Normal Distribution PPT With Assignment 1 Without Answers
No ratings yet
Normal Distribution PPT With Assignment 1 Without Answers
33 pages
Advance Data Analysis - Students
No ratings yet
Advance Data Analysis - Students
82 pages
AE 248: AI and Data Science: Prabhu Ramachandran 2024-01-01
No ratings yet
AE 248: AI and Data Science: Prabhu Ramachandran 2024-01-01
9 pages
The Normal Binomial and Poisson Distributions
No ratings yet
The Normal Binomial and Poisson Distributions
25 pages
MLCourse Slides
No ratings yet
MLCourse Slides
356 pages
Assignment
No ratings yet
Assignment
14 pages
STAT 241 - Unit 5 Notes
No ratings yet
STAT 241 - Unit 5 Notes
9 pages
Sampling CLT CI
No ratings yet
Sampling CLT CI
81 pages
STATISTICS
No ratings yet
STATISTICS
48 pages
Statical Data 1
No ratings yet
Statical Data 1
32 pages
QM Lecture 10 - Chi Square Tests (1)
No ratings yet
QM Lecture 10 - Chi Square Tests (1)
48 pages
12 Chi Square
No ratings yet
12 Chi Square
31 pages
Review: Application of The Normal Distribution
No ratings yet
Review: Application of The Normal Distribution
70 pages
Sampling CLT CI
No ratings yet
Sampling CLT CI
81 pages
Statistik Deskriptif - 2016 BIRU
No ratings yet
Statistik Deskriptif - 2016 BIRU
64 pages
Bab 3 Pengantar Inferensi Statistika
No ratings yet
Bab 3 Pengantar Inferensi Statistika
46 pages
Basic Statistics: Introductory Workshop MS-Bapm
No ratings yet
Basic Statistics: Introductory Workshop MS-Bapm
78 pages
Statistical Analysis: Dr. Shahid Iqbal Fall 2021
No ratings yet
Statistical Analysis: Dr. Shahid Iqbal Fall 2021
65 pages
MLCourseSlides
No ratings yet
MLCourseSlides
427 pages
ML Course Slides
No ratings yet
ML Course Slides
356 pages
Principles of The T-Test and ANOVA
No ratings yet
Principles of The T-Test and ANOVA
64 pages
Introduction To Analytics
No ratings yet
Introduction To Analytics
50 pages
KWT 5.ukuran Keragaman Data
No ratings yet
KWT 5.ukuran Keragaman Data
27 pages
Central Limit Theorem 2
No ratings yet
Central Limit Theorem 2
18 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Marketing Management MCQ
100% (1)
Marketing Management MCQ
2 pages
Current Affairs e Book 2014
No ratings yet
Current Affairs e Book 2014
24 pages
Clearing GD'S Like A Champion: 1) Increase Your Pool of Knowledge: Make A Habit of "READING" and
No ratings yet
Clearing GD'S Like A Champion: 1) Increase Your Pool of Knowledge: Make A Habit of "READING" and
3 pages
The A / An N: Practice 1
No ratings yet
The A / An N: Practice 1
4 pages
Unit 3: Corporate Social Responsibility
No ratings yet
Unit 3: Corporate Social Responsibility
14 pages
Emerging Retail Trends in India Indian Journal
No ratings yet
Emerging Retail Trends in India Indian Journal
23 pages
Cet 12 M Bares 090312
No ratings yet
Cet 12 M Bares 090312
134 pages
Group Decision Support System
No ratings yet
Group Decision Support System
25 pages
Ethical Dilemma
No ratings yet
Ethical Dilemma
5 pages
CSR Assignment A.1: Approach Initiatives Key Facts and Figures
No ratings yet
CSR Assignment A.1: Approach Initiatives Key Facts and Figures
3 pages
Unit 3: Corporate Social Responsibility
No ratings yet
Unit 3: Corporate Social Responsibility
14 pages
Tab Technology
No ratings yet
Tab Technology
1 page
The Application of Data Mining Techniques and Multiple Classifiers To Marketing Decision
No ratings yet
The Application of Data Mining Techniques and Multiple Classifiers To Marketing Decision
10 pages
Team Building in Organizations Assignment Submitted By: Abhinav Pandey Enrollment No: 08861203912 MBA-4 SEM
No ratings yet
Team Building in Organizations Assignment Submitted By: Abhinav Pandey Enrollment No: 08861203912 MBA-4 SEM
6 pages
Soft Drink Beverage Project
No ratings yet
Soft Drink Beverage Project
4 pages
An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules
No ratings yet
An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules
8 pages
Stiffness Matrix Method For Nonlinear Analysis of
No ratings yet
Stiffness Matrix Method For Nonlinear Analysis of
9 pages
35 International Chemistry Olympiad: Athens, Greece Theoretical Examination Thursday, 10 July 2003
No ratings yet
35 International Chemistry Olympiad: Athens, Greece Theoretical Examination Thursday, 10 July 2003
30 pages
Using GW-BASIC For Drawing Mandelbrot Sets
No ratings yet
Using GW-BASIC For Drawing Mandelbrot Sets
3 pages
Mechanical Engineering - Yale University
No ratings yet
Mechanical Engineering - Yale University
7 pages
Predicting Inflation With Neural Networks: Livia Paranhos
No ratings yet
Predicting Inflation With Neural Networks: Livia Paranhos
47 pages
EE3301 Electromagnetic Fields Lecture Notes 1
No ratings yet
EE3301 Electromagnetic Fields Lecture Notes 1
84 pages
Good Introduction To Elastomer
No ratings yet
Good Introduction To Elastomer
12 pages
Rajasthan Topper Nakul Pareek 2nd Grade Math Teaching Method Sample
No ratings yet
Rajasthan Topper Nakul Pareek 2nd Grade Math Teaching Method Sample
24 pages
PSOC-Unit-III Q-V Control
100% (1)
PSOC-Unit-III Q-V Control
103 pages
Kinematics Is The Study of The Motion of An Object/body Without Considering The Cause of
No ratings yet
Kinematics Is The Study of The Motion of An Object/body Without Considering The Cause of
13 pages
Full State Feedback Control
No ratings yet
Full State Feedback Control
8 pages
Calculating Intensity With The Inverse Square Law (1) TTTTTTTTTTTT
100% (1)
Calculating Intensity With The Inverse Square Law (1) TTTTTTTTTTTT
6 pages
FTA
No ratings yet
FTA
7 pages
Kleen Theorem PDF
No ratings yet
Kleen Theorem PDF
59 pages
Computer Programming and Utilization
No ratings yet
Computer Programming and Utilization
8 pages
Work Energy and Power1
No ratings yet
Work Energy and Power1
13 pages
11th Maths EM Vol2 Study Materials English Medium PDF Download
No ratings yet
11th Maths EM Vol2 Study Materials English Medium PDF Download
21 pages
Paper 3
No ratings yet
Paper 3
36 pages
MFF mt10035 S
No ratings yet
MFF mt10035 S
458 pages
TOS in Math
No ratings yet
TOS in Math
5 pages
Kenken PDF
No ratings yet
Kenken PDF
10 pages
Pressure Transient Analysis
No ratings yet
Pressure Transient Analysis
58 pages
Maggi - Strategic Trade Policies With Endogenous Mode of Competition
No ratings yet
Maggi - Strategic Trade Policies With Endogenous Mode of Competition
23 pages
A+ BLOG-MATHEMATICS-CHAPTER-6- COORDINATES-DAILY TEST [EM]
No ratings yet
A+ BLOG-MATHEMATICS-CHAPTER-6- COORDINATES-DAILY TEST [EM]
7 pages
AN704: SCM7B: Application Note: Failure Rate Calculation and Prediction
No ratings yet
AN704: SCM7B: Application Note: Failure Rate Calculation and Prediction
1 page
Statistics and Probability (Topic 5) Revision
No ratings yet
Statistics and Probability (Topic 5) Revision
4 pages
Data Structure and Algorithm (CS-102) : Ashok K Turuk
No ratings yet
Data Structure and Algorithm (CS-102) : Ashok K Turuk
58 pages