A Validity Index Based On Connectivity: 2009 Seventh International Conference On Advances in Pattern Recognition
A Validity Index Based On Connectivity: 2009 Seventh International Conference On Advances in Pattern Recognition
A Validity Index Based On Connectivity: 2009 Seventh International Conference On Advances in Pattern Recognition
92
Table 1. Experimental Results on Several Data sets. 2800 2800
2600 2600
2000 2000
1000 2
1000 1000
Spiral 2 2 2
850
800 800
Mixed 5 2 2 5 5 6 300 400 500 600 700 800 900 300 400 500 600 700 800 900
4
2400
1400
−2
its maximum value when all the clusters are connected and 800
300 400 500 600 700 800 900
−6
−8 −6 −4 −2 0 2 4 6 8
2200
2200
2000 2000
1800 1800
1600
1600
1400
1400
1200
1200 to show the efficacy of the proposed cluster validity in-
1000
1000
800
300 400 500 600 700 800 900
dex, connect-index. The description of the data sets used
800
300 400 500 600 700 800 900
here for experiment is shown in Table 1. Pat1 and Pat2
(a) (b) data sets are used in Ref.[9], Spiral data set is used
Figure 3. (a) Pat1 (b) Pat2 in Ref.[10] and Mixed 5 2 data set is used in Ref.[5].
Figures 3(a), 3(b), 4(a), 4(b) show the four artificial data
sets, respectively. The two real-life datasets are obtained
6 16 from (http://www.ics.uci.edu/∼mlearn/MLRepository.html).
4
14
Iris data set represents different categories of irises charac-
12
2 10
terized by four feature values. It has three classes Setosa,
0
8 Versicolor and Virginica. It is known that two classes (Ver-
6
sicolor and Virginica) have a large amount of overlap while
−2 4
2
the class Setosa is linearly separable from the other two.
−4
0 The Wisconsin Breast Cancer data set has two categories in
−6
−8 −6 −4 −2 0 2 4 6 8
−2
−10 −8 −6 −4 −2 0 2 4 6 8
it: malignant and benign. The two classes are known to be
(a) (b) linearly separable.
Figure 4. (a) Spiral (b) Mixed 5 2 Single Linkage clustering is used to partition √ all the
above mentioned data sets for K = 2, . . . , n and the
corresponding connect-index values are computed for all
the partitions. Then the partition which corresponds to the
3. Experimental Results maximum value of connect-index is taken as the optimal
partitioning and the corresponding number of clusters is
Here the popular single linkage clustering technique regarded as the optimal number of clusters indicated by
[1] is used to partition the data sets used for experi- connect-index. For all the data sets used here for experiment,
ments. Four artificial and two real-life data sets are used the optimal number of clusters indicated by connect-index
93
16 16
14 14
4. Discussion and Conclusion
12 12
10 10
Identifying the proper number of clusters and the proper
8 8 partitioning from a data set are two crucial issues in un-
6 6
supervised classification. In this paper one cluster validity
4 4
index is developed for this purpose. The proposed index
2 2
is able to detect the appropriate number of clusters and
0 0
−2 −2
the appropriate partitioning from data sets as long as the
−10 −8 −6 −4 −2 0 2 4 6 8 −10 −8 −6 −4 −2 0 2 4 6 8
References
are reported in Table 1. For the purpose of comparison,
the number of clusters identified by the popular Dunn’s [1] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis.
index [3] for all data sets used here for experiment are London: Arnold, 2001.
also reported in Table 1. This table reveals that in most [2] U. Maulik and S. Bandyopadhyay, “Performance evaluation
of the cases the proposed connect-index is able to identify of some clustering algorithms and validity indices,” IEEE
the appropriate number of clusters from almost all the data Transactions on Pattern Analysis and Machine Intelligence,
sets used here for experiment while Dunn’s index is able to vol. 24, no. 12, pp. 1650–1654, 2002.
detect the appropriate number of clusters from two out of [3] J. C. Dunn, “A fuzzy relative of the ISODATA process and
these six data sets. For Iris data set, both the validity indices its use in detecting compact well-separated clusters,” Journal
provide K ∗ = 2, which is also often obtained for many of Cybernetics, vol. 3, pp. 32–57, 1973.
other methods for Iris. Figures 5(a), 6(a), 6(b), 7(a) show,
[4] C. H. Chou, M. C. Su, and E. Lai, “A new cluster validity
respectively, the optimal partitionings indicated by connect- measure and its application to image compression,” Pattern
index for four artificial data sets used here for experiment. Analysis and Applications, vol. 7, pp. 205–220, 2004.
Similarly Figures 5(b), 6(a), 6(b) and 7(b) show, respectively,
the optimal partitionings indicated by popular Dunn’s index [5] S. Bandyopadhyay and S. Saha, “A point symmetry based
clustering technique for automatic evolution of clusters,”
for these four artificial data sets. IEEE Transactions on Knowledge and Data Engineering,
vol. 20, no. 11, pp. 1–17, November, 2008.
For the two real-life data sets, Iris and Cancer, no
[6] S. Saha and S. Bandyopadhyay, “Application of a new
visualization is possible as these are high-dimensional data symmetry based cluster validity index for satellite image
sets. The Minkowski Score (MS) [11] is calculated after segmentation,” IEEE Geoscience and Remote Sensing Letters,
application of Single Linkage clustering technique for these vol. 5, no. 2, pp. 166–170, 2008.
two real-life data sets. This is a measure of the quality of a
[7] G. T. Toussaint, “The realtive neighborhood graph of a finite
solution given the true clustering. Let T be the “true” solu- planar set,” Pattern Recognition, vol. 12, pp. 261–268, 1980.
tion and S the solution we wish to measure. Denote by n11
the number of pairs of elements that are in the same cluster [8] S. Bandyopadhyay, “An automatic shape independent clus-
in both S and T. Denote by n01 the number of pairs that tering technique,” Pattern Recognition, vol. 37, pp. 33–45,
2004.
are in the same cluster only in S, and by n10 the number of
pairs that are in the same cluster in
T. Minkowski Score (MS) [9] S. K. Pal, S. Bandyopadhyay, and C. A. Murthy, “Genetic
n01 +n10 algorithms for generation of class boundaries,” IEEE Trans.
is then defined as: M S(T, S) = n11 +n10 .. For MS, the System Man Cybernet, vol. 28, no. 6, pp. 816–828, 1998.
optimum score is 0, with lower scores being “better”. For Iris
data set, MS value corresponding to the partitioning obtained [10] J. Handl and J. Knowles, “An evolutionary approach to mul-
by Single Linkage clustering for K = 2 is 0.88. Again for tiobjective clustering,” IEEE Transactions on Evolutionary
Computation, vol. 11, no. 1, pp. 56–76, 2007.
Cancer data set, Single Linkage clustering technique obtains
a MS of 0.43 for K = 2 (number of partitions indicated by [11] A. Ben-Hur and I. Guyon, Detecting Stable Clusters using
newly proposed connect-index) while that of K = 6 (number Principal Component Analysis in Methods in Molecular Bi-
of partitions indicated by Dunn’s index) is 1.45. ology. Humana press, 2003.
94