Applied Data Analysis (With SPSS)
Applied Data Analysis (With SPSS)
March 2011
Prof. Dr. Jrg Schwarz
juerg.schwarz@hslu.ch
Slide 2
Contents
Aims ___________________________________________________________________________________________________ 5
Introduction _____________________________________________________________________________________________ 6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Slide 3
Table of contents
Aims ___________________________________________________________________________________________________ 5
Aims of the lecture .................................................................................................................................................................................................5
Introduction _____________________________________________________________________________________________ 6
Example .................................................................................................................................................................................................................6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Key steps in using a cluster analysis ....................................................................................................................................................................10
How to measure proximity....................................................................................................................................................................................11
Proximity measure with interval variables.............................................................................................................................................................13
Proximity measure with binary variables...............................................................................................................................................................15
How to form Clusters............................................................................................................................................................................................18
How to define similarity? ............................................................................................................................................................................................................18
Cluster formation tree (rules for cluster formation) ....................................................................................................................................................................20
Pros and cons ............................................................................................................................................................................................................................21
Example of hierarchical method: Single linkage (nearest neighbor) .........................................................................................................................................22
Example of hierarchical method: Complete linkage (furthest neighbor) ....................................................................................................................................23
Slide 4
Cluster Analysis with SPSS: A detailed example ______________________________________________________________ 24
Marketing research: Customer survey on brand awareness .................................................................................................................................24
SPSS Elements: <Analyze><Classify><Hierarchical ...........................................................................................................................................25
First step: Measure of distance or similarity between objects ...............................................................................................................................27
Output ........................................................................................................................................................................................................................................27
Slide 5
Aims
Aims of the lecture
You know different types of measures of distance / similarity
Slide 6
Introduction
Example
Marketing research: Customer survey on brand awareness ("Markenbewusstsein")
Survey features
Brand awareness [Index]
Slide 7
Question
Is there a linear relation between brand awareness and yearly income?
Hypothesis: The higher a person's income, the higher his/her brand awareness.
Slide 8
Question
Is there structure in the brand awareness dataset?
Are there clusters for the combination of yearly income and brand awareness?
Slide 9
Outline
Cluster analysis is a multivariate procedure for detecting natural groupings in data.
The grouping is based on the scores of several measures (e.g. income and awareness).
Slide 10
Slide 11
Variable 2
Object 1
Object 2
Object 3
:
Object k
Variable 3
Variable j
raw data
Object 2
Object 3
Object k
distance or similarity
Slide 12
Slide 13
2.95
c
b
1.73
{1.67, 1.73}
a 2 + b 2 = c 2 =>
a2 + b2 = c
0.97
1.67
d001,002
= [1.67 0.97
+ 2.95 1.73
2 1/ 2
= [0.490 + 1.488 ]
= 1.407
1/ 2
Slide 14
Generalized equation
Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist)
J
r
dk,l = x kj x lj
j=1
1/ r
r = Minkowski's constant
dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)
J = Number of cluster variables (e.g. variables income and awareness)
xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)
Values of Minkowski's constant
L2
Slide 15
Mercedes
BMW
Case
ABS
0
0
Airbag
1
1
Configuration
ESP
1
1
Navi
1
0
Metallic
0
1
1 = feature present
4 Cases
A
Slide 16
Sij =
a + 1 d
a + (b + c) + 2 d
Variants
Description
Definition
Sij =
a
a+b+c +d
Simple matching
Sij =
a+d
a+b+c +d
Dice
Sij =
2a
2a + b + c
*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships,
*University of Kansas science bulletin, 38:1409--1438, 1958.
Slide 17
Mercedes
BMW
ABS
0
0
Airbag
1
1
Configuration
ESP
1
1
Case
Navi
1
0
Metallic
0
1
Count of cases
a=2
b=1
c=1
d=1
1 = feature present
Measure
Proximity
Sij =
2
2
= = 0.4
2 + 1+ 1+ 1 5
Simple matching
Sij =
2 +1
3
= = 0.6
2 + 1+ 1+ 1 5
Dice
22
4
Sij =
= = 0.67
2 2 + 1+ 1 6
Some remarks
Sij varies between 0 and 1
There is no "right" proximity measure
Important question/decision:
Is non-existence important?
(<=> taking case d into account?)
Slide 18
Cluster B
1.
2.
3.
Slide 19
d2
d + d2 + ... = di
2
1
i =1
Slide 20
Hierarchical
Non-hierarchical
Agglomerative
Linkage
methods
Single
linkage
Complete
linkage
Average
linkage
Divisive
Variance
methods
Other
linkage
Wards
procedure
k-Means
procedure
Slide 21
Hierarchical clustering
No a priori decision about the number of clusters
Can be very slow
Non-hierarchical clustering
Need to specify the number of clusters (can be an arbitrary number)
Faster, more reliable
Features
Procedure
Proximity measure
Remark
Single linkage
distance or similarity
Complete linkage
distance or similarity
Average linkage
distance or similarity
Other linkage
only distance
No remark
Ward's method
only distance
Slide 22
Step k
Step k + 1
"chain"
nearest neighbor
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
Slide 23
Step k
Step k + 1
furthest neighbor
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
Slide 24
Data
Brand awareness [Index]
Random sub-sample of n = 15
(Why this small sub-sample?
Just to keep track of what SPSS does.)
Slide 25
Slide 26
Syntax
Variables included
/METHOD BAVERAGE
/MEASURE= EUCLID
/ID=person
/PRINT SCHEDULE
/PRINT DISTANCE
Slide 27
Example:
Distance between cases 9 and 7
:
:
Slide 28
Between-groups linkage
Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}
First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}
Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}
Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}
:
Slide 29
Dendrogram
Stage 1
Stage 5
Stage 2
Stage 3
Slide 30
Icicle plot
14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.
13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.
12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.
:
Slide 31
Proximity ("Coefficients")
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1
10
11
12
13
14
15
Slide 32
B) Dendrogram
Choose the number of clusters within the largest increase in heterogeneity
Standardized distance
Slide 33
Slide 34
Range of solutions: 2 to 5
Slide 35
Slide 36
Slide 37
Objects
Person A
Person B
Person C
Person D
Person E
Person F
Attributes
general
attitude to
willingness
attitude to life innovation
to take risks
1
2
2
1
3
3
2
4
2
5 Data of 64 people
3
5
4
4
7
6
7
Cluster
Slide 38
(A, B, C)
(D, E)
(F)
general
attitude to life
1.3
5
7
Attributes
attitude to
innovation
3
4
6
willingness
to take risks
2.3
3.5
7