Malhotra Mr05 PPT 20
Malhotra Mr05 PPT 20
Malhotra Mr05 PPT 20
Cluster Analysis
6) Clustering Variables
7) Summary
Fig. 20.1
Variable 1
Variable 2
Fig. 20.2
Variable 1
X
Variable 2
1 6 4 7 3 2 3
2 2 3 1 4 5 4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
5 1 3 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3
10 3 5 3 6 4 6
11 1 3 2 3 5 3
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
20
© 2007 Prentice Hall
2 3 2 4 7 20-11
Conduc ting Cl ust er Analy si s
For mulat e the Pro blem
Perhaps the most important part of formulating the
clustering problem is selecting the variables on which
the clustering is based.
Inclusion of even one or two irrelevant variables may
distort an otherwise useful clustering solution.
Basically, the set of variables selected should describe
the similarity between objects in terms that are
relevant to the marketing research problem.
The variables should be selected based on past
research, theory, or a consideration of the hypotheses
being tested. In exploratory research, the researcher
should exercise judgment and intuition.
© 2007 Prentice Hall 20-12
Conduc ting Cl ust er Analy si s
Se lect a D is tanc e or Sim ila ri ty Meas ure
The most commonly used measure of similarity is the Euclidean
distance or its square. The Eu clidean dis tance is the square root
of the sum of the squared differences in values for each variable.
Other distance measures are also available. The city-block or
Manhattan distance between two objects is the sum of the absolute
differences in values for each variable. The Chebychev distance
between two objects is the maximum absolute difference in values
for any variable.
If the variables are measured in vastly different units, the clustering
solution will be influenced by the units of measurement. In these
cases, before clustering respondents, we must standardize the data
by rescaling each variable to have a mean of zero and a standard
deviation of unity. It is also desirable to eliminate outliers (cases
with atypical values).
Use of different distance measures may lead to different clustering
results. Hence, it is advisable to use different measures and
compare the results.
© 2007 Prentice Hall 20-13
A Cl as sif ic atio n of Cl ustering
Pr oc edure s
Fig. 20.4 Clustering Procedures
Ward’s
Method
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
© 2007 Prentice Hall 20-17
Conduc ting Cl ust er Anal ysi s
Se lect a C lus tering Pr oc ed ur e – V ari anc e
Method
The var ianc e me tho ds attempt to generate clusters to
minimize the within-cluster variance.
A commonly used variance method is the Ward 's p roc edure .
For each cluster, the means for all the variables are computed.
Then, for each object, the squared Euclidean distance to the
cluster means is calculated (Figure 20.6). These distances are
summed for all the objects. At each stage, the two clusters with
the smallest increase in the overall sum of squares within cluster
distances are combined.
In the cent roid me thod s, the distance between two clusters is
the distance between their centroids (means for all the variables),
as shown in Figure 20.6. Every time objects are grouped, a new
centroid is computed.
Of the hierarchical methods, average linkage and Ward's methods
have been shown to perform better than the other procedures.
© 2007 Prentice Hall 20-18
Ot he r Agglo merat ive Cl ustering
Methods
Fig. 20.6
Ward’s Procedure
Centroid Method
1 1 1 1
2 2 2 2
3 1 1 1
4 3 3 2
5 2 2 2
6 1 1 1
7 1 1 1
8 1 1 1
9 2 2 2
10 3 3 2
11 2 2 2
12 1 1 1
13 2 2 2
14 3 3 2
15 1 1 1
16 3 3 2
17 1 1 1
18 4 3 2
19 3 3 2
20 2 2 2
© 2007 Prentice Hall 20-23
Verti cal Icicle Plo t Us ing War d’ s
Method
Fig. 20.7
Means of Variables
Cluster No. V1 V2 V3 V4 V5 V6
3
© 2007 Prentice Hall
3.500 5.833 3.333 6.000 3.500 6.000 20-28
Co nduc ti ng Cl ust er Ana lysis
Asse ss Reli abil ity a nd Vali di ty
1. Perform cluster analysis on the same data using different
distance measures. Compare the results across measures to
determine the stability of the solutions.
2. Use different methods of clustering and compare the results.
3. Split the data randomly into halves. Perform clustering
separately on each half. Compare cluster centroids across
the two subsamples.
4. Delete variables randomly. Perform clustering based on the
reduced set of variables. Compare the results with those
obtained by clustering based on the entire set of variables.
5. In nonhierarchical clustering, the solution may depend on
the order of cases in the data set. Make multiple runs using
different order of cases until the solution stabilizes.
© 2007 Prentice Hall 20-29
Resu lts of No nhier ar chica l
Cl uster ing
Table 20.4
Initial Cluster Centers
Cluster
1 2 3
V1 4 2 7
V2 6 3 2
V3 3 2 6
V4 7 4 4
V5 2 7 1
V6 7 2 3 a
Iteration History
Change in Cluster Centers
Iteration 1 2 3
1 2.154 2.102 2.550
2 0.000 0.000 0.000
a. Convergence achieved due to no or small distance
change. The maximum distance by which any center
has changed is 0.000. The current iteration is 2. The
minimum distance between initial centers is 7.746.
© 2007 Prentice Hall 20-30
Res ul ts o f Nonh ier ar ch ica l
Cl uster ing
Table 20.4 cont. Cluster Membership
Case Number Cluster Distance
1 3 1.414
2 2 1.323
3 3 2.550
4 1 1.404
5 2 1.848
6 3 1.225
7 3 1.500
8 3 2.121
9 2 1.756
10 1 1.143
11 2 1.041
12 3 1.581
13 2 2.598
14 1 1.404
15 3 2.828
16 1 1.624
17 3 2.598
18 1 3.555
19 1 2.154
20 2 2.102
© 2007 Prentice Hall 20-31
Resu lts of No nhier ar ch ica l
Cl uster ing
Table 20.4, cont.
Fi nal Cl ust er Cen ter s
Cluster
1 2 3
V1 4 2 6
V2 6 3 4
V3 3 2 6
V4 6 4 3
V5 4 6 2
V6 6 3 4
ANOVA
Cluster Error
Mean Square df Mean Square df F Sig.
V1 29.108 2 0.608 17 47.888 0.000
V2 13.546 2 0.630 17 21.505 0.000
V3 31.392 2 0.833 17 37.670 0.000
V4 15.713 2 0.728 17 21.585 0.000
V5 22.537 2 0.816 17 27.614 0.000
V6 12.171 2 1.071 17 11.363 0.001
The F tests should be used only for descriptive purposes because the clusters have been
chosen to maximize the differences among cases in different clusters. The observed
significance levels are not corrected for this, and thus cannot be interpreted as tests of the
hypothesis that the cluster means are equal.
Akaike's Ratio of
Information AIC Ratio of AIC Distance
Number of Clusters Criterion (AIC) Change(a) Changes(b) Measures(c)
1 104.140
2 101.171 -2.969 1.000 .847
3 97.594 -3.577 1.205 1.583
4 116.896 19.302 -6.502 2.115
5 138.230 21.335 -7.187 1.222
6 158.586 20.355 -6.857 1.021
7 179.340 20.755 -6.991 1.224
8 201.628 22.288 -7.508 1.006
9 224.055 22.426 -7.555 1.111
10 246.522 22.467 -7.568 1.588
11 269.570 23.048 -7.764 1.001
12 292.718 23.148 -7.798 1.055
13 316.120 23.402 -7.883 1.002
14 339.223 23.103 -7.782 1.044
15 362.650 23.427 -7.892 1.004
a The changes are from the previous number of clusters in the table.
b The ratios of changes are relative to the change for the two cluster solution.
c The ratios of distance measures are based on the current number of clusters
against the previous number of clusters.
© 2007 Prentice Hall 20-34
Cl uster Distri bu ti on
Table 20.5, cont.
% of
N Combined % of Total
Cluster 1 6 30.0% 30.0%
2 6 30.0% 30.0%
3 8 40.0% 40.0%
Combined 20 100.0% 100.0%
Total 20 100.0%