Determining The Number of Groups From Measures of Cluster Stability
Determining The Number of Groups From Measures of Cluster Stability
Determining The Number of Groups From Measures of Cluster Stability
1 Introduction
A major challenge in cluster analysis is the validation of clusters resulting
from cluster analysis algorithms. One relevant approach involves defining
an index measuring the adequacy of a cluster structure to the data set and
establishing how likely a given value of the index is under some null model
formalizing ‘no cluster structure’, e.g., [Bailey and Dubes, 1982], [Jain and
Dubes, 1988], [Gordon, 1994], [Milligan, 1996] and [Gordon, 1999]. Another
type of approach is concerned with the estimation of the stability of clustering
results. Informally speaking, cluster stability holds when membership of
the clusters is not affected by small changes in the data set [Cheng and
Milligan, 1996]. Several recent approaches, see for example [Tibshirani et al.,
2001], [Levine and Domany, 2001], [Ben-Hur et al., 2002] and [Bertrand and
Bel Mufti, 2005], suggest that cluster stability is a valuable way to determine
the number of clusters of any partitioning of the data. Such a stability based
approach aims to identify those values of the number of clusters (or any other
parameter of the clustering method) for which local maxima of stability are
reached.
The main contribution of this paper is to compare this stability based
approach with two of the most (classical) successful methods of predicting
Determining the number of groups from cluster stability 405
n0 (n0 − 1)m(X 0 ; A, A)
t(A, X 0 ) = 1 − , (1)
2n0A (n0 − n0A ) m(X 0 )
where m(X 0 ) is the number of pairs of objects that are clustered together by
Pk (X 0 ), and where m(X 0 ; A, A) is the number of pairs of sampled objects that
are in the same cluster of Pk (X 0 ) and for which exactly one of the two objects
belongs to A. Taking into account only the criterion of cluster isolation, the
stability measure of cluster A is defined simply as the average, denoted here
by tN (A), of the values t(A, Xi0 ) obtained for a large number N of samples
Xi0 (i = 1, . . . , N ):
1 X
N
tN (A) = t(A, Xi0 ). (2)
N i=1
procedure presented by Jain and Dubes [Jain and Dubes, 1988] (see also
[Gordon, 1994]), since it seems reasonable to specify the absence of cluster
stability by the absence of clustering structure:
For example, the value tN (A) = 0.899 is an indication of high stability if and
only if its estimated probability significance value under H0 is less than 5%.
Otherwise, many indices that measure the adequation between the par-
tition and the data set were proposed to determine the number of clusters.
According to the survey of Milligan and Cooper [Milligan and Cooper, 1985],
the index of Calinski and Harabasz [Calinski and Harabasz, 1974] and the
index of Krzanowski and Lai [Krzanowski and Lai, 1985] are among the in-
dices that perform the best (see also Tibshirani et al [Tibshirani et al., 2001]
408 Bel Mufti et al.
DIF F (k)
KL(k) = | |, (4)
DIF F (k + 1)
where:
DIF F (k) = (k − 1)2/p W (k − 1) − (k)2/p W (k), (5)
and p denotes the number of features in the data set. A value of k is optimal
if it maximizes KL(k).
The rest of the section is devoted to the comparison of the performance
of the three indices BB, CH and KL on the basis of results obtained for two
data sets: an artificial data set and the well known Iris data set.
We consider the artificial data set that is represented in Figure 1. This data
set is a 200 point sample of a mixture of four normal distributions.
cluster 1
cluster 2
cluster 3
cluster 4
Each cluster is indeed a 50 point sample of one of the four normal distri-
butions, and except for one point, the four clusters are easily identified by
looking at Figure 1. The four normal distributions are centered respectively
at µ1 = (−1.5, −.5), µ2 = (3, 2), µ3 = (0, 4) and µ4 = (4.5, 0) and have
the same variance-covariance matrix V = .5I, where I denotes the identity
matrix.
This data set was partitioned using the batch K-means method and the
stability measures were computed with the ratio sampling f = 0.8. The
values of the three indices are given in Table 1 for k ∈ {2, 3, 4, 5, 6}. The
probability significances under (H0 ) (p-values) suggest that the 4-partition is
the most significant.
Index 2 3 4 5 6
CH(k) 145 414 580 ∗
494 446
KL(k) .26 3.36 3.89 1.39 5.95 ∗
Table 1. Values of the three indices for partitions of the artificial data. According
to each index (row), a symbol (∗ ) indicates the optimal numbers of clusters.
Table 3 presents the stability measures of the 5-partition. Note that with
a p-value that is less than 4.5% at a 97.5%-approximate coverage probability,
the global validity of the partition into 5 clusters can be deemed as significant.
Each of these stability measures were computed with a precision of at least
0.02, and N = 1500 samples were necessary in order to obtain this precision.
It turns out that the clusters 1, 2 and 3 (which coincide with clusters 3, 4
410 Bel Mufti et al.
Table 2. Stability measures for the 4-partition (prec. 0.01), and their p-values (%).
and 2 respectively in Figure 1) are clearly stable, for all cluster characteristics
except the cohesion of cluster 3. Clusters 4 and 5 (obtained by splitting the
cluster 1 of Figure 1 into two clusters) are assessed by low stability values
(i.e., .716 and .777) and by high p-values (i.e., in the intervals 34 − 50% and
22 − 39%). Therefore, their existence is clearly dubious. Stability measures
for partial isolation between clusters were also computed: the extremely weak
stability measure for partial isolation between cluster 4 and cluster 5 (i.e.,
-.999) suggests that the split represents more a dissection than a real cluster
structure involving separate and homogeneous clusters.
Table 3. Stability measures (prec. 0.01) of the 5-partition, and their p-values (%).
Determining the number of groups from cluster stability 411
Index 2 3 4 5
CH(k) 756 1211 1266 1358∗
KL(k) 4.83 6.01∗ 1.3 1.12
BB(k) .992 ∗
.959 .881 .900
Prob. signif. of BB(k) (%) .3 − 3.4 6.7 − 11.9 > 34 5.2 − 9.4
Table 4. Values of the indices on Iris data partitions. According to each index
(row), a symbol (∗ ) indicates an optimal number of clusters.
Table 4 shows the values of the three indices used for choosing the optimal
number of clusters on Iris data. The 2-partition with a p-value between .3
and 3.4% is the most stable partition according to the index BB, followed by
the 5-partition and the 3-partition with p-values in the intervals 5.2 − 9.4%
and 6.7 − 11.9%, respectively. Even if the p-values of the last two partitions
do not differ significantly, the large p-values of the stability measures of two
clusters of the 5-partition (i.e., in the intervals 39 − 53% and 52 − 65%) raise
doubts about the validity of this partition (see also [Bertrand and Bel Mufti,
2005]). The stability measure BB is the only one to identify the trivial
partition in two clusters, and the KL index identifies the 3-partition as the
optimal one. Choosing the 5-partition, the index CH is the worst performer
on the Iris data set.
4 Conclusion
The results presented in this paper confirm that measuring cluster stability
can be a valuable approach to determine the ‘correct’ number of clusters of
any partition. A real advantage of this general approach is that it does not
require selecting or using any measure of adequation between the data set
and the partition examined.
412 Bel Mufti et al.
It can be noticed that the p-values for assessing the measures of clus-
ter stability may be decisive when estimating the stability of clusters. For
example, the p-values of Table 1 show that the stability value .915, which
assesses the stability of the 5-partition, is statistically more significant un-
der the null hypothesis of absence of structure, than the stability value .958
which assesses the stability of the 3-partition. In addition, an advantage of
the stability based approach that is proposed in [Bertrand and Bel Mufti,
2005] is that a careful interpretation of the p-values of the stability measures
enables one to identify not only a pertinent partition but also several sources
of variation in the partitional stability, such as individual cluster isolation
and cohesion.
References
[Bailey and Dubes, 1982]T. A. Bailey and R. Dubes. Cluster validity profiles. Pat-
tern Reconition 15, 61–83, 1982.
[Ben-Hur et al., 2002]A. Ben-Hur, A. Elisseeff and I. Guyon. A stability based
method for discovering structure in clustered data. Pacific Symposium on
Biocomputing 7, 6–17, 2002.
[Bertrand and Bel Mufti, 2005]P. Bertrand and G. Bel Mufti. Loevinger’s measures
of rule quality for assessing cluster stability. Computational Statistics and Data
Analysis, 2005, to appear.
[Calinski and Harabasz, 1974]R. B. Calinski and J. Harabasz. A dendrite method
for cluster analysis. Communications in Statistics 3, 1–27, 1974.
[Cheng and Milligan, 1996]R. Cheng and G. W. Milligan. Measuring the influence
of individual data points in a cluster analysis. J. Classification 13, 315–335,
1996.
[Gordon, 1994]A. D. Gordon. Identifying genuine clusters in a classification. Com-
putational Statistics and Data Analysis 18, 561–581, 1994.
[Gordon, 1999]A. D. Gordon. Classification. Chapman & Hall, 1999.
[Jain and Dubes, 1988]A. K. Jain and R. Dubes. Algorithms for clustering data.
Prentice-Hall, Englewood Cliffs, NJ, 1988.
[Krzanowski and Lai, 1985]W. J. Krzanowski and Y. T. Lai. A criterion for de-
terming the number of groups in a data set using sum-of-squares clustering.
Biometrics 44, 23–34, 1985.
[Lenca et al., 2003]P. Lenca, P. Meyer, B. Vaillant and S. Lallich. Critères
d’évaluation des mesures de qualité en ECD. Revue des Nouvelles Technologies
de l’Information (Entreposage et Fouille de données), 1, 123–134, 2003.
[Levine and Domany, 2001]E. Levine and E. Domany. Resampling method for un-
supervised estimation of cluster validity. Neural Comput. 13, 2573–2593, 2001.
[Loevinger, 1947]J. Loevinger. A systemic approach to the construction and eval-
uation of tests of ability. Psychological Monographs, 61 (4), 1947.
[Milligan and Cooper, 1985]G. W. Milligan and M. C. Cooper. An examination of
procedures for determining the number of clusters in a data set. Psychometrika
50, 159–179, 1985.
[Milligan, 1996]G. W. Milligan. Clustering validation: results and implications
for applied analyses. In P. Arabie, L. J. Hubert and G. De Soete, editors,
Determining the number of groups from cluster stability 413
Clustering and Classification. Word Scientific Publ., River Edge, NJ, pp. 341-
375, 1996.
[Tibshirani et al., 2001]R. Tibshirani, G. Walther, D. Botstein and P. Brown. Clus-
ter validation by prediction strength. Stanford Technical Report, Department
of Statistics, Stanford University, USA, 2001.