Data Mining Tutorial: D. A. Dickey
Data Mining Tutorial: D. A. Dickey
D. A. Dickey
Trees
A divisive method (splits) Start with root node all in one group Get splitting rules Response often binary Result is a tree Example: Loan Defaults Example: Framingham Heart Study Example: Automobile fatalities
Recursive Splitting
Pr{default} =0.007
Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.003 Pr{default} =0.012
Pr{default} =0.0001
No default Default
X2 = Age
Example of a tree
All 1615 patients
Split # 1: Age
Systolic BP
terminal node
options: (1) assessment measure: Avg. Sq. Error (2) N=4, (3) Gini splits
Goal: Pure leaves or terminal nodes Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems
Where to Split?
First review Chi-square tests Contingency tables
Heart Disease No Yes Low BP Heart Disease No Yes
95
100 100
75 75
25 25
High BP
55
45
DEPENDENT
INDEPENDENT
c2 Test Statistic
Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25)
Heart Disease No Yes
2 ( observed exp ected ) c 2 allcells exp ected
Low BP
High BP
95 (75) 55 (75)
150
5 (25) 45 (25)
50
100
100
200
?
Age 47 maximizes logworth
Idea Pick BP cutoff to minimize p-value for c2 What does signifiance mean now?
Multiple testing
50 different BPs in data, 49 ways to split Sunday football highlights always look good! If he shoots enough times, even a 95% free throw shooter will miss. Tried 49 splits, each has 5% chance of declaring significance even if theres no relationship.
Multiple testing
a= Pr{ falsely reject hypothesis 2}
a= Pr{ falsely reject hypothesis 1} Pr{ falsely reject one or the other} < 2a Desired: 0.05 probabilty or less Solution: use a = 0.05/2 Or compare 2(p-value) to 0.05
Multiple testing
50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni original idea Kass apply to data mining (trees) Stop splitting if minimum p-value is large. For m splits, logworth becomes -log10(m*p-value) ! ! !
{AABCBAABCC}
1-[0.15-0.09-0.09] = 0.66 MORE DIVERSE, LESS PURE
Shannon Entropy
Larger more diverse (less pure) -Si pi log2(pi)
{0.5, 0.4, 0.1} 1.36 {0.4, 0.3, 0.3} 1.74 * (EM uses sampling with replacement) (less diverse) (more diverse)
Goals
Split if diversity in parent node > summed diversities in child nodes Observations should be
Homogeneous (not diverse) within leaves Different between leaves Leaves should be diverse
Validation
Traditional stats small dataset, need all observations to estimate parameters of interest. Data mining loads of data, can afford holdout sample Variation: n-fold cross validation
Randomly divide data into n sets Estimate on n-1, validate on 1 Repeat n times, using each set as holdout.
Pruning
Grow bushy tree on the fit data Classify holdout data Likely farthest out branches do not improve, possibly hurt fit on holdout data Prune non-helpful branches. What is helpful? What is good discriminator criterion?
Goals
Want diversity in parent node > summed diversities in child nodes Goal is to reduce diversity within leaves Goal is to maximize differences between leaves Use validation average squared error, proportion correct decisions, etc. Costs (profits) may enter the picture for splitting or pruning.
Including Probabilities
Leaf has Pr(M)=.7, Pr(F)=.3. Sir True Gender M 0.7 (2) 0.7 (-10) You say: Maam
0.3 (5)
+$1.10
-$5.50
Expected profit is 2(0.7)-1(0.3) = $1.10 if I say sir Expected profit is -7+1.5 = -$5.50 (a loss) if I say Maam Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits.
Additional Ideas
Forests Draw samples with replacement (bootstrap) and grow multiple trees. Random Forests Randomly sample the features (predictors) and build multiple trees. Classify new point in each tree then average the probabilities, or take a plurality vote from the trees
Lift
3.3
* Cumulative Lift Chart - Go from leaf of most to least predicted 1 response. - Lift is proportion responding in first p% overall population response rate
Regression Trees
Continuous response Y Predicted response Pi constant in regions i=1, , 5
Predict 80
Predict 50
X2 Predict 130 Predict 20
Predict 100 X1
Prediction PREDi in cell i. Yij jth response in cell i. Split to minimize Si Sj (Yij-PREDi)2
Predict 80
Predict 50
Predict 130
Predict 20
Predict 100
Real data example: Traffic accidents in Portugal* Y = injury induced cost to society
* Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education
If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others
http://www.ofesite.com/spirit/palm/lines/linelife.htm
Result: Predicted Age at Death = 79.24 1.367(lifeline) (Is this real??? Is this repeatable???)
Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. Simulate 20 cases with n= 50 bodies each.
NOTE: Regression equations : Age(rep:1) = 80.56253 - 1.345896*line. Age(rep:2) = 61.76292 + 0.745289*line. Age(rep:3) = 72.14366 - 0.546996*line. Age(rep:4) = 95.85143 - 3.087247*line. Age(rep:5) = 67.21784 - 0.144763*line. Age(rep:6) = 71.0178 - 0.332015*line. Age(rep:7) = 54.9211 + 1.541255*line. Age(rep:8) = 69.98573 - 0.472335*line. Age(rep:9) = 85.73131 - 1.240894*line. Age(rep:10) = 59.65101 + 0.548992*line. Age(rep:11) = 59.38712 + 0.995162*line. Age(rep:12) = 72.45697 - 0.649575*line. Age(rep:13) = 78.99126 - 0.866334*line. Age(rep:14) = 45.88373 + 2.283475*line. Age(rep:15) = 59.28049 + 0.790884*line. Age(rep:16) = 73.6395 - 0.814287*line. Age(rep:17) = 70.57868 - 0.799404*line. Age(rep:18) = 72.91134 - 0.821219*line. Age(rep:19) = 55.46755 + 1.238873*line. Age(rep:20) = 63.82712 + 0.776548*line.
Predicted Age at Death = 79.24 1.367(lifeline) Would NOT be unusual if there is no true relationship .
Distribution of t Under H0
Conclusion: Estimated slopes vary Standard deviation (estimated) of sample slopes = Standard error Compute t = (estimate hypothesized)/standard error p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope) p-value is sum of two tail areas. Traditionally p<0.05 implies hypothesized value is wrong. p>0.05 is inconclusive.
t Value Pr > |t| 5.34 <.0001 -0.86 0.3965 Area 0.19825 Area 0.19825 0.39650
-0.86
0.86
H0: H1:
H0: True slope is 0 (no association) H1: True slope is not 0 P=0.3965
Simulation: Age at Death = 67 + 0(life line) + e Error e has normal distribution mean 0 variance 200. WHY? Simulate 20 cases with n= 50 bodies each.
Want estimate of variability around the true line. True variance is Use sums of squared residuals (SS).
Sum of squared residuals from the mean is SS(total) 9755 Sum of squared residuals around the line is SS(error) 9609 (1) SS(total)-SS(error) is SS(model) = 146 (2) Variance estimate is SS(error)/(degrees of freedom) = 200 (3) SS(model)/SS(total) is R2, i.e. proportion of variablity explained by the model.
Analysis of Variance Sum of Squares 146.51753 9608.70247 9755.22000 0.0150 Mean Square 146.51753 200.18130
DF 1 48 49
F Value 0.73
Pr > F 0.3965
R-Square
2nd Martian gives first piece of information (DF) about error variance around mean.
Martian Height
Martian Weight
DF 2 37 39
error
X2
X1
Fourth leg gives first chance to measure error (first error DF). Fit a plane n-3 (37) error DF (2 model DF, n-1=39 total DF)
Extension:
Multiple Regression
Issues: (1) Testing joint importance versus individual significance Two engine plane can still fly if engine #1 fails Two engine plane can still fly if engine #2 fails Neither is critical individually Jointly critical (cant omit both!!) (2) Prediction versus modeling individual effects (3) Collinearity (correlation among inputs) Example: Hypothetical companys sales Y depend on TV advertising X1 and Radio Advertising X2. Y = b0 + b1X1 + b2X2 +e
Data Sales; length sval $8; length cval $8; input store TV radio sales; (more code) cards; Sales 1 869 868 9089 2 836 820 8290 (more data) 40 969 961 10130
Conclusion: Can predict well with just TV, just radio, or both! SAS code: proc reg data=next; model sales = TV radio;
Analysis of Variance Sum of Squares 32660996 1683844 34344840 R-Square Mean Square 16330498 45509
DF 2 37 39 213.32908
F Value 358.84
Parameter Estimates Parameter Estimate 531.11390 5.00435 4.66752 Standard Error 359.90429 5.01845 4.94312
DF 1 1 1
Pr > |t| 0.1485 0.3251 (can omit TV) 0.3512 (can omit radio)
Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (standard deviation 213). TV approximately equal to radio so, approximately Estimated Sales = 531 + 9.7 TV or
Summary:
Good predictions given by Sales = 531 + 5.0 x TV + 4.7 x Radio or Sales = 479 + 9.7 x TV or Sales = 612 + 9.6 x Radio or (lots of others)
Multicollinearity can be diagnosed by looking at principal components (axes of variation) Variance along PC axes eigenvalues of correlation matrix Direction axes point eigenvectors of correlation matrix Principal Component Axis 1
TV
TV 1.00000
radio
0.99737 <.0001 1.00000
sales
0.97457 <.0001 0.97450 <.0001
radio
0.99737 <.0001
sales
0.97457 <.0001
0.97450 <.0001
1.00000
Radio $
TEXT MINING
Hypothetical collection of e-mails (corpus) from analytics students: John, message 1: Theres a good cook there. Susan, message 1: I have an analytics practicum then. Susan, message 2: Ill be late from analytics. John, message 2: Shall we take the kids to a movie? John, message 3: Later we can eat what I cooked yesterday. (etc.) Compute word counts: analytics cook_n cook_v kids late movie practicum John 0 1 1 1 1 1 0 Susan 2 0 0 0 1 0 1
s t u d e n t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J o b 5 5 0 8 0 10 1 2 4 26 19 2 16 14 1 3
P r a c t i c u m 8 6 2 9 0 6 0 3 1 13 22 0 19 17 0 5
A n a l y t i c s 10 9 0 7 4 9 1 1 4 9 10 0 21 12 4 8
M o v i e 12 5 14 0 16 5 6 13 16 2 11 14 0 0 21 0
D a t a 6 4 0 12 0 5 2 0 2 16 9 1 13 20 3 1
S A S 0 2 2 14 0 19 1 1 4 20 12 3 9 19 6 2
K i d s 1 0 12 2 15 5 9 12 9 6 0 12 0 0 9 0
M i n e r 5 9 0 12 2 20 0 13 0 24 14 0 16 12 3 5
I n t e r v i e w 8 12 4 15 3 18 0 0 9 30 22 12 12 9 0 4
L a t e 18 12 24 22 9 13 2 0 3 9 3 17 0 6 3 6
C o o k _ v 5 1 18 0 18 8 6 12 0 7 2 14 0 3 9 1
C o o k _ n 0 0 4 0 9 1 0 1 0 2 0 3 0 0 3 0
d o c u m e n t 1 2 4 6 10 11 13 14 16 3 5 7 8 9 12 15
C L U S T E R 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
P r i n 1 0.15311 0.93370 2.08576 1.74995 3.70319 2.76166 3.77000 3.37595 0.44444 -3.62271 -4.18243 -1.90553 -2.54416 -1.41349 -2.98274 -2.32671
J o b 5 5 8 10 26 19 16 14 3 0 0 1 2 4 2 1
P r a c t i c u m
8 6 9 6 13 22 19 17 5 2 0 0 3 1 0 0
A n a l y t i c s
10 9 7 9 9 10 21 12 8 0 4 1 1 4 0 4
M o v i e 12 5 0 5 2 11 0 0 0 14 16 6 13 16 14 21
D a t a 6 4 12 5 16 9 13 20 1 0 0 2 0 2 1 3
S A S 0 2 14 19 20 12 9 19 2 2 0 1 1 4 3 6
K i d s 1 0 2 5 6 0 0 0 0 12 15 9 12 9 12 9
M i n e r 5 9 12 20 24 14 16 12 5 0 2 0 13 0 0 3
G r o c e r y l i s t 3 0 3 0 4 10 4 5 0 16 17 10 20 12 16 8
I n t e r v i e w
8 12 15 18 30 22 12 9 4 4 3 0 0 9 12 0
L a t e 18 12 22 13 9 3 0 6 6 24 9 2 0 3 17 3
C o o k _ v 5 1 0 8 7 2 0 3 1 18 18 6 12 0 14 9
C o o k _ n 0 0 0 1 2 0 0 0 0 4 9 0 1 0 3 3
Unsupervised Learning
We have the features (predictors) We do NOT have the response even on a training data set (UNsupervised) Clustering
Agglomerative
Start with each point separated
Divisive
Start with all points in one cluster then spilt
Direct
State # clusters beforehand
EM PROC FASTCLUS
Step 1 find (50) seeds as separated as possible Step 2 cluster points to nearest seed
Drift: As points are added, change seed (centroid) to average of each coordinate Alternatively: Make full pass then recompute seed and iterate.
Clusters as Created
Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time; cards; 105 10 75 110 12 79 120 6 68 116 13 85 122 16 91 130 8 79 114 20 98 102 15 76 ; Proc reg data=tests; model Grade = IQ; Proc reg data=tests; model Grade = IQ Study_Time;
Variable Intercept IQ DF 1 1 Parameter Estimate 62.57113 0.16369 Parameter Estimate 0.73655 0.47308 2.10344 Standard Error 48.24164 0.41877 Standard Error 16.26280 0.12998 0.26418 t Value 1.30 0.39 Pr > |t| 0.2423 0.7094
DF 1 1 1
Contrast: TV advertising looses significance when radio is added. IQ gains significance when study time is added.
Model for Grades: Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time
Question: Does an extra hour of study really deliver 2.10 points for everyone regardless of IQ? Current model only allows this.
DF 3 4 7
F Value 26.22
Pr > F 0.0043
0.9516
DF 1 1 1 1
Interaction model: Predicted Grade = 72.21 0.13 x IQ 4.11 x Study Time + 0.053 x IQ x Study Time = (72.21 0.13 x IQ )+( 4.11 + 0.053 x IQ )x Study Time IQ = 102 predicts Grade = (72.21-13.26)+(5.41-4.11) x Study Time = 58.95+ 1.30 x Study Time IQ = 122 predicts Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time
Slope = 2.36
Slope = 1.30
Adding interaction makes everything insignificant (individually) ! Do we need to omit insignificant terms until only significant ones remain? Has an acquitted defendant proved his innocence? Common sense trumps statistics!
Classification Variables (dummy variables, indicator variables) Predicted Accidents = 1181 + 2579 X11 X11 is 1 in November, 0 elsewhere. Interpretation: In November, predict 1181+2579(1) = 3660. In any other month predict 1181 + 2579(0) = 1181. 1181 is average of other months. 2579 is added November effect (vs. average of others) Model for NC Crashes involving Deer: Proc reg data=deer; model deer = X11;
Analysis of Variance Sum of Squares 30473250 19539666 50012916 R-Square Mean Square 30473250 336891
DF 1 58 59 580.42294
F Value 90.45
Pr > F <.0001
Label Intercept
DF 1 1
Looks like December and October need dummies too! Proc reg data=deer; model deer = X10 X11 X12;
Analysis of Variance Sum of Squares 46152434 3860482 50012916 262.55890 Parameter Estimate 929.40000 1391.20000 2830.20000 1377.40000 Mean Square 15384145 68937 date F Value 223.16 Pr > F <.0001 JAN03 FEB03 MAR03 APR03 MAY03 JUN03 JUL03 AUG03 SEP03 OCT03 NOV03 DEC03 JAN04 FEB04 MAR04 APR04 MAY04 JUN04 JUL04 AUG04 SEP04 OCT04 NOV04 DEC04 x10 x11 x12 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
DF 3 56 59
0.9228
DF 1 1 1 1
Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December.
What the heck lets do all but one (need average of rest so must leave out at least one) Proc reg data=deer; model deer = X1 X2 X10 X11;
Analysis of Variance Sum of Squares 48421690 1591226 50012916 Mean Square 4401972 33151
DF 11 48 59
F Value 132.79
Pr > F <.0001
Root MSE
182.07290
R-Square
0.9682
Parameter Estimates Parameter Estimate 2306.80000 -885.80000 -1181.40000 -1220.20000 -1486.80000 -1526.80000 -1433.00000 -1559.20000 -1646.20000 -1457.20000 13.80000 1452.80000 Standard Error 81.42548 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301
Label Intercept
DF 1 1 1 1 1 1 1 1 1 1 1 1
t Value 28.33 -7.69 -10.26 -10.60 -12.91 -13.26 -12.44 -13.54 -14.30 -12.65 0.12 12.62
Pr > |t| <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.9051 <.0001
Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December.
positive
negative
Add date (days since Jan 1 1960 in SAS) to capture trend Proc reg data=deer; model deer = date X1 X2 X10 X11;
Analysis of Variance Sum of Squares 49220571 792345 50012916 Mean Square 4101714 16858
DF 12 47 59
F Value 243.30
Pr > F <.0001
Root MSE
129.83992
R-Square
0.9842
Parameter Estimates Parameter Estimate -1439.94000 -811.13686 -1113.66253 -1158.76265 -1432.28832 -1478.99057 -1392.11624 -1525.01849 -1618.94416 -1436.86982 27.42792 1459.50226 0.22341 Standard Error 547.36656 82.83115 82.70543 82.60154 82.49890 82.41114 82.33246 82.26796 82.21337 82.17106 82.14183 82.12374 0.03245
Label Intercept
DF 1 1 1 1 1 1 1 1 1 1 1 1 1
t Value -2.63 -9.79 -13.47 -14.03 -17.36 -17.95 -16.91 -18.54 -19.69 -17.49 0.33 17.77 6.88
Pr > |t| 0.0115 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.7399 <.0001 <.0001
Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0.
Logistic Regression
Trees seem to be main tool. Logistic another classifier Older tried & true method Predict probability of response from input variables (Features) Linear regression gives infinite range of predictions 0 < probability < 1 so not linear regression.
Logistic idea: Map p in (0,1) to L in whole real line Use L = ln(p/(1-p)) Model L as linear in temperature, e.g. Predicted L = a + b(temperature) Given temperature X, compute L(x)=a+bX then p = eL/(1+eL) p(i) = ea+bXi/(1+ea+bXi) Write p(i) if response, 1-p(i) if not Multiply all n of these together, find a,b to maximize
-2.6
0.23
Concordant pair
Discordant Pair
DF 1 1
Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 79.2 20.8 0.0 48 Somers' D Gamma Tau-a c 0.583 0.583 0.308 0.792
Example: Framingham
X=age Y=1 if heart trouble, 0 otherwise
Framingham
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Estimate Error Chi-Square -5.4639 0.0630 0.5563 0.0110 96.4711 32.6152
DF 1 1
Neural Networks
Very flexible functions Hidden Layers Multilayer Perceptron
output inputs
Arrows represent linear combinations of basis functions, e.g. logistic curves (hyperbolic tangents)
p1 p2 p3 b2 b3
b1 Y
Example: Y = a + b1 p1 + b2 p2 + b3 p3 Y = 4 + p1+ 2 p2 - 4 p3
Should always use holdout sample Perturb coefficients to optimize fit (fit data)
Nonlinear search algorithms
Independent variables Dependent variable Estimation Clustering Prediction Slopes, Betas Intercept
Composition of Hyperbolic Tangent Functions Radial Basis Function and my personal Type I and Type II Errors
Features Target Training, Supervised Learning Unsupervised Learning Scoring Weights (Neural nets) Bias (Neural nets)
Neural Network Normal Density favorite Confusion Matrix
Association Analysis
Market basket analysis
What theyre doing when they scan your VIP card at the grocery People who buy diapers tend to also buy _________ (beer?) Just a matter of accounting but with new terminology (of course ) Examples from SAS Appl. DM Techniques, by Sue Walsh:
Termnilogy
Baskets: ABC ACD BCD ADE BCE Rule Support Confidence X=>Y Pr{X and Y} Pr{Y|X} A=>D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3
ABC ACD BCD ADE BCE
Dont be Fooled!
Lift = Confidence /Expected Confidence if Independent
Checking Saving No Yes No (1500) 500 1000 Yes (8500) 3500 5000 (10000) 4000 6000
SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!!
Summary
Data mining a set of fast stat methods for large data sets Some new ideas, many old or extensions of old Some methods: Trees (recursive splitting) Logistic Regression Neural Networks Association Analysis Nearest Neighbor Clustering Etc.
TEXT MINING
Hypothetical collection of news releases (corpus) : release 1: Did the NCAA investigate the basketball scores and vote for sanctions? release 2: Republicans voted for and Democrats voted against it for the win. (etc.) Compute word counts: NCAA basketball score vote Republican Democrat win Release 1 1 1 1 1 0 0 0 Release 2 0 0 0 2 1 1 1
d o c u m e n t 1 2 3 4 5 6 7 8 9 10 11 12 13 14
E l e c t i o n 20 5 0 8 0 10 2 4 26 19 2 16 14 1
D e m o c r a t 6 4 0 12 0 5 0 2 16 9 1 13 20 3
V o t e r s 0 2 2 14 0 19 1 4 20 12 3 9 19 6
N C A A 1 0 12 2 15 5 12 9 6 0 12 0 0 9
L i a r 5 9 0 12 2 20 13 0 24 14 0 16 12 3
S p e e c h 8 12 4 15 3 18 0 9 30 22 12 12 9 0
W i n s 18 12 24 22 9 13 0 3 9 3 17 0 6 3
S c o r e _ V 15 9 19 8 0 9 1 0 10 1 23 0 1 10
S c o r e _ N 21 0 30 2 1 14 6 0 14 0 8 2 4 20
Eigenvalues of the Correlation Matrix Eigenvalue 1 2 3 4 5 6 7 8 9 10 11 12 13 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 Difference 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Cumulative 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000
Prin 2
Prin 1
Eigenvalues of the Correlation Matrix Eigenvalue 1 2 3 4 5 6 7 8 9 10 11 12 13 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 Difference 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Cumulative 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000
Prin 2 Prin 1
Cluster 2
Cluster 1
Plot of Prin1*Prin2$document.
Prin1 4 > 12 > 9 > 13 3 > 10 2 4 <> 6 1 > 2 0 > 1 -1 > 8 -2 > 7 > 14 -3 > 5 > 11 > 3 -4 -3 -2 -1 0 1 2 3 Prin2
D.A.D.