A Validity Index Based On Connectivity: 2009 Seventh International Conference On Advances in Pattern Recognition
Table 1. Experimental Results on Several Data sets. 2800 2800
2600 2600
2000 2000
1000 2
1000 1000
Spiral 2 2 2
800 800
Mixed 5 2 2 5 5 6 300 400 500 600 700 800 900 300 400 500 600 700 800 900
its maximum value when all the clusters are connected and 800
300 400 500 600 700 800 900
−8 −6 −4 −2 0 2 4 6 8
2000 2000
1800 1800
1200 to show the efficacy of the proposed cluster validity in-
300 400 500 600 700 800 900
dex, connect-index. The description of the data sets used
300 400 500 600 700 800 900
here for experiment is shown in Table 1. Pat1 and Pat2
(a) (b) data sets are used in Ref.[9], Spiral data set is used
Figure 3. (a) Pat1 (b) Pat2 in Ref.[10] and Mixed 5 2 data set is used in Ref.[5].
Figures 3(a), 3(b), 4(a), 4(b) show the four artificial data
sets, respectively. The two real-life datasets are obtained
6 16 from (http://www.ics.uci.edu/∼mlearn/MLRepository.html).
Iris data set represents different categories of irises charac-
2 10
terized by four feature values. It has three classes Setosa,
8 Versicolor and Virginica. It is known that two classes (Ver-
sicolor and Virginica) have a large amount of overlap while
−2 4
the class Setosa is linearly separable from the other two.
0 The Wisconsin Breast Cancer data set has two categories in
−8 −6 −4 −2 0 2 4 6 8
−10 −8 −6 −4 −2 0 2 4 6 8
it: malignant and benign. The two classes are known to be
(a) (b) linearly separable.
Figure 4. (a) Spiral (b) Mixed 5 2 Single Linkage clustering is used to partition √ all the
above mentioned data sets for K = 2, . . . , n and the
corresponding connect-index values are computed for all
the partitions. Then the partition which corresponds to the
3. Experimental Results maximum value of connect-index is taken as the optimal
partitioning and the corresponding number of clusters is
Here the popular single linkage clustering technique regarded as the optimal number of clusters indicated by
[1] is used to partition the data sets used for experi- connect-index. For all the data sets used here for experiment,
ments. Four artificial and two real-life data sets are used the optimal number of clusters indicated by connect-index
16 16
14 14
4. Discussion and Conclusion
12 12
10 10
Identifying the proper number of clusters and the proper
8 8 partitioning from a data set are two crucial issues in un-
6 6
supervised classification. In this paper one cluster validity
4 4
index is developed for this purpose. The proposed index
2 2
is able to detect the appropriate number of clusters and
0 0
−2 −2
the appropriate partitioning from data sets as long as the
−10 −8 −6 −4 −2 0 2 4 6 8 −10 −8 −6 −4 −2 0 2 4 6 8
are reported in Table 1. For the purpose of comparison,
