Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Validity Index Based On Connectivity: 2009 Seventh International Conference On Advances in Pattern Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2009 Seventh International Conference on Advances in Pattern Recognition

A Validity Index Based on Connectivity

Sriparna Saha and Sanghamitra Bandyopadhyay


Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Email:{sriparna r, sanghami}@isical.ac.in

Abstract index [2], Dunn’s index [3], Calinski-Harabasz index [2],


and a new index I, in conjunction with three different
In this paper we have developed a connectivity based algorithms, viz., the well-known K-means [1], single-linkage
cluster validity index. This validity index is able to detect algorithm [1] and a SA-based clustering method [2]. But
the number of clusters automatically from data sets having most of the existing cluster validity indices are able to
well separated clusters of any shape, size or convexity. detect the partitioning where clusters are either having the
The proposed cluster validity index, connect-index, uses the hyperspherical shape or symmetrical shape. Dunn’s index
concept of relative neighborhood graph for measuring the [3] is able to detect clusters having different shapes but
amount of “connectedness” of a particular cluster. The sometimes it prefers the partitioning where some clusters
proposed connect-index is inspired by the popular Dunn’s are merged together to maximize the minimum separation
index for measuring the cluster validity. Single linkage between clusters. In this paper one cluster validity index is
clustering algorithm is used as the underlying partitioning developed which is able to detect the appropriate partitioning
technique. The superiority of the proposed validity measure from data sets having clusters of any shape, size or convexity
in comparison with Dunn’s index is shown for four artificial as long as they are well-separated.
and two real-life data sets. The concept of relative neighborhood graph (RNG) [7]
has been successfully applied for solving several pattern
1. Introduction recognition problems. One unsupervised clustering tech-
nique based on the concepts of RNG is developed in Ref. [8].
Clustering [1] is a core technique in data-mining with In this article the concepts of relative neighborhood graph [7]
innumerable applications spanning many fields. Model se- is used to develop a new cluster validity index. The proposed
lection in clustering consists of two steps. In the first step index, connect-index, quantifies the degree of connectivity
the proper clustering method for a particular data set has of individual clusters, while they are well-separated as well.
to be decided upon. Once this choice has been made, one The index is inspired by the popular Dunn’s index [3].
has to determine the number of clusters and also assess But it outperforms the Dunn’s index [3] for determining
the validity of the clusters formed [1]. For this purpose the number of clusters from data sets having well-separated
several cluster validity indices have been proposed in the clusters. Single-linkage clustering technique [1] is used as
literature. The measure of validity of the clusters should the underlying partitioning method. The effectiveness of the
be such that it will be able to impose an ordering of the proposed index in comparison with the popular Dunn’s index
partitionings in terms of its goodness. In other words, if [3] is shown for four artificially generated and two real-life
U1 , U2 , . . . , Um be the m partitions of X, and the corre- data sets.
sponding values of a validity measure be V1 , V2 , . . . Vm , then
Vk1 ≥ Vk2 ≥ . . . Vkm , ∀ki ∈ 1, 2, . . . , m, i = 1, 2, . . . , m 2. Proposed Cluster Validity Index
will indicate that Uk1 ↑ . . . ↑ Ukm . Here ‘Ui ↑ Uj ’ indicates
that partition Ui is a better clustering than Uj . Note that
In this section at first the concept of relative neighborhood
a validity measure may also define a decreasing sequence
graph (RNG) [7][8] is first described. This is followed by
instead of an increasing sequence of Vk1 , . . . , Vkm .
a detailed description of the cluster validity index proposed
Several cluster validity indices have been proposed in
here. It is based on the concept of relative neighborhood
the literature. These include Davies-Bouldin (DB) index [2],
graph in order to measure the amount of “connectedness”
Dunn’s index [3], Xie-Beni (XB) index [2], I-index [2], CS-
among the clusters.
index [4], Sym-index [5], [6] etc., to name just a few. Some
of these indices have been found to be able to detect the
correct partitioning for a given number of clusters, while 2.1. Relative Neighborhood Graph
some can determine the appropriate number of clusters as
well. Maulik and Bandyopadhyay [2] evaluated the perfor- Suppose r is an integer and p, q are two points in r-
mance of four validity indices, namely, the Davies-Bouldin dimensional Euclidean space. Then the lune of p and q

978-0-7695-3520-3/09 $25.00 © 2009 IEEE 91


DOI 10.1109/ICAPR.2009.53
2) The distance between any two points, x and y, denoted
as dshort (x, y), is measured along the relative neigh-
borhood graph. Find all possible paths among these
two points along the RNG. Suppose there are total
p paths between x and y, and the number of edges
along the ith path is ni , for i = 1, . . . , p. If the edges
along the ith path are denoted as edi1 , . . . , edini and the
corresponding edge weights are w(edi1 ), . . . , w(edini ),
Figure 1. The lune of two points p and q is the region then the shortest distance between x and y is defined
between the two arcs, not including the boundary. as follows:
p ni
dshort (p, q) = min max w(edij ).
i=1 j=1

2.3. Proposed Cluster Validity Index

The proposed cluster validity index is defined as fol-


lows. Suppose the clusters formed are denoted by Ci , for
i = 1, . . . , K, where K is the number of clusters. Then the
diameter of a particular cluster is denoted as diam(Ci ), for
i = 1, . . . , K, which is defined below:
Figure 2. (a) A set of points in the plane (b) RNG of the
points in (a) diam(Ci ) = max {dshort (x, y)}.
x,y∈Ci

Here dshort (x, y) is as defined in Section 2.2.


(denoted lun(p, q) or lun(pq)) is the set of points The distance between any two clusters Ci and Cj where
i, j = 1, . . . , K, i = j, is defined as follows:
{z ∈ Rr : d(p, z) < d(p.q) and d(q, z) < d(p, q)},
dist(Ci , Cj ) = min {dshort (x, y)}
where d denotes the Euclidean distance. Alternatively, x∈Ci and y∈Cj
lun(p, q) denotes the interior of the region formed by the Now the proposed connectivity based cluster validity index,
intersection of two r-dimensional hyperspheres of radius connect-index, is defined as follows:
d(p, q), one of the hyperspheres being centered at p and
the other at q. This is illustrated in Figure 1 which shows dist(Ci , Cj )
connect = min { min { }}.
the lune of two points p, q in the plane. If V is a set of 1≤i≤K 1≤j≤K,i=j max1≤k≤K {diam(Ck )}
n points in r-space, then define the relative neighborhood Intuitively larger values of connect corresponds to good
graph of V (denoted RN G(V ) or simply RNG when V partitioning. Thus the appropriate number of clusters is
is understood) to be the undirected graph with vertices determined by maximizing connect over different values
V such that for each pair p, q ∈ V , pq is an edge of of K. If connecti denotes the connect-index value for the
RN G(V ) iff lun(p, q)∩V = ∅. Here the edge weight of number of clusters, K = i, then the appropriate number of
a particular edge (pq) is kept equal to d(p, q), the Euclidean clusters, K ∗ , is determined as:
distance between the points p and q.
Figure 2(a) shows a set V of points in the plane; Figure K ∗ = argopt{ max connecti }.
i=1,...,Kmax
2(b) shows the RNG of this set of points V . The RNG
problem is: Given a set V , find RN G(V ). Here Kmax is the maximum possible √ number of clusters. In
general, Kmax is kept equal to n, where n is the number
of points in the data set.
2.2. Measuring the Connectivity Among a Set of
connect-index has two components. Its denominator mea-
Points sures the maximum shortest distance among any two points
in a particular cluster. If the cluster is completely connected
In order to measure the connectivity among a set of points then the shortest distance between any two points would be
we have used the above discussed relative neighborhood very small and thus the diameter of that particular cluster
graph concept. Here the distance between a pair of points is would be small too. As connect-index tries to minimize the
measured in the following way. maximum diameter amongst all clusters, this in turn tries
1) Construct the relative neighborhood graph of the to minimize the diameter of every clusters. Thus when all
whole data set. clusters are well-connected, their diameters are small and the

92
Table 1. Experimental Results on Several Data sets. 2800 2800

2600 2600

Here AC denotes the actual number of clusters and OC 2400 2400

denotes the obtained number of clusters. 2200 2200

2000 2000

Name # points dimension AC OC 1800 1800

connect Dunn 1600 1600

Pat1 557 2 3 3 2 1400 1400

Pat2 417 2 2 2 2 1200 1200

1000 2
1000 1000
Spiral 2 2 2
850
800 800

Mixed 5 2 2 5 5 6 300 400 500 600 700 800 900 300 400 500 600 700 800 900

Iris 150 4 3 2 2 (a) (b)


Cancer 683 9 2 2 6
Figure 5. Optimal Partitioning on Pat1 indicated by (a)
proposed connect-index for K ∗ = 3 (b) Dunn’s index for
K∗ = 2
denominator of the connect-index gets a smaller value. The
numerator of the connect-index is the minimum separation 2800 6

between any two clusters which is measured as the minimum 2600

4
2400

shortest distance between any two points belonging to two 2200


2

different clusters along the RNG. In order to increase the 2000

value of connect-index, the numerator of this index has 1800 0

to be maximized, thus the minimum separation between 1600

1400
−2

any two clusters should be maximum. This only happens 1200


−4

if the clusters are well-separated. Thus connect-index gets 1000

its maximum value when all the clusters are connected and 800
300 400 500 600 700 800 900
−6
−8 −6 −4 −2 0 2 4 6 8

well-separated as well. (a) (b)


2800
Figure 6. Optimal Partitioning indicated by both
connect-index and Dunn’s index on (a) Pat2 for K ∗ = 2
2800
2600
2600

(b) Spiral data set for K ∗ = 2


2400
2400

2200
2200

2000 2000

1800 1800

1600
1600

1400
1400

1200
1200 to show the efficacy of the proposed cluster validity in-
1000
1000

800
300 400 500 600 700 800 900
dex, connect-index. The description of the data sets used
800
300 400 500 600 700 800 900
here for experiment is shown in Table 1. Pat1 and Pat2
(a) (b) data sets are used in Ref.[9], Spiral data set is used
Figure 3. (a) Pat1 (b) Pat2 in Ref.[10] and Mixed 5 2 data set is used in Ref.[5].
Figures 3(a), 3(b), 4(a), 4(b) show the four artificial data
sets, respectively. The two real-life datasets are obtained
6 16 from (http://www.ics.uci.edu/∼mlearn/MLRepository.html).
4
14
Iris data set represents different categories of irises charac-
12

2 10
terized by four feature values. It has three classes Setosa,
0
8 Versicolor and Virginica. It is known that two classes (Ver-
6
sicolor and Virginica) have a large amount of overlap while
−2 4

2
the class Setosa is linearly separable from the other two.
−4
0 The Wisconsin Breast Cancer data set has two categories in
−6
−8 −6 −4 −2 0 2 4 6 8
−2
−10 −8 −6 −4 −2 0 2 4 6 8
it: malignant and benign. The two classes are known to be
(a) (b) linearly separable.
Figure 4. (a) Spiral (b) Mixed 5 2 Single Linkage clustering is used to partition √ all the
above mentioned data sets for K = 2, . . . , n and the
corresponding connect-index values are computed for all
the partitions. Then the partition which corresponds to the
3. Experimental Results maximum value of connect-index is taken as the optimal
partitioning and the corresponding number of clusters is
Here the popular single linkage clustering technique regarded as the optimal number of clusters indicated by
[1] is used to partition the data sets used for experi- connect-index. For all the data sets used here for experiment,
ments. Four artificial and two real-life data sets are used the optimal number of clusters indicated by connect-index

93
16 16

14 14
4. Discussion and Conclusion
12 12

10 10
Identifying the proper number of clusters and the proper
8 8 partitioning from a data set are two crucial issues in un-
6 6
supervised classification. In this paper one cluster validity
4 4
index is developed for this purpose. The proposed index
2 2
is able to detect the appropriate number of clusters and
0 0

−2 −2
the appropriate partitioning from data sets as long as the
−10 −8 −6 −4 −2 0 2 4 6 8 −10 −8 −6 −4 −2 0 2 4 6 8

clusters are well separated either having any shape, size


(a) (b)
or convexity. The effectiveness of the proposed index in
Figure 7. Optimal Partitioning for Mixed 5 2 indicated comparison with one existing cluster validity index, Dunn’s
by (a) proposed connect-index for K ∗ = 5 (b) Dunn’s index, is shown for four artificial and two real-life data sets.
index for K ∗ = 6 Future work includes developing some mathematical proof
of the proposed index. Comparing the proposed validity
index with other existing indices more extensively is another
important future research work.

References
are reported in Table 1. For the purpose of comparison,
the number of clusters identified by the popular Dunn’s [1] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis.
index [3] for all data sets used here for experiment are London: Arnold, 2001.
also reported in Table 1. This table reveals that in most [2] U. Maulik and S. Bandyopadhyay, “Performance evaluation
of the cases the proposed connect-index is able to identify of some clustering algorithms and validity indices,” IEEE
the appropriate number of clusters from almost all the data Transactions on Pattern Analysis and Machine Intelligence,
sets used here for experiment while Dunn’s index is able to vol. 24, no. 12, pp. 1650–1654, 2002.
detect the appropriate number of clusters from two out of [3] J. C. Dunn, “A fuzzy relative of the ISODATA process and
these six data sets. For Iris data set, both the validity indices its use in detecting compact well-separated clusters,” Journal
provide K ∗ = 2, which is also often obtained for many of Cybernetics, vol. 3, pp. 32–57, 1973.
other methods for Iris. Figures 5(a), 6(a), 6(b), 7(a) show,
[4] C. H. Chou, M. C. Su, and E. Lai, “A new cluster validity
respectively, the optimal partitionings indicated by connect- measure and its application to image compression,” Pattern
index for four artificial data sets used here for experiment. Analysis and Applications, vol. 7, pp. 205–220, 2004.
Similarly Figures 5(b), 6(a), 6(b) and 7(b) show, respectively,
the optimal partitionings indicated by popular Dunn’s index [5] S. Bandyopadhyay and S. Saha, “A point symmetry based
clustering technique for automatic evolution of clusters,”
for these four artificial data sets. IEEE Transactions on Knowledge and Data Engineering,
vol. 20, no. 11, pp. 1–17, November, 2008.
For the two real-life data sets, Iris and Cancer, no
[6] S. Saha and S. Bandyopadhyay, “Application of a new
visualization is possible as these are high-dimensional data symmetry based cluster validity index for satellite image
sets. The Minkowski Score (MS) [11] is calculated after segmentation,” IEEE Geoscience and Remote Sensing Letters,
application of Single Linkage clustering technique for these vol. 5, no. 2, pp. 166–170, 2008.
two real-life data sets. This is a measure of the quality of a
[7] G. T. Toussaint, “The realtive neighborhood graph of a finite
solution given the true clustering. Let T be the “true” solu- planar set,” Pattern Recognition, vol. 12, pp. 261–268, 1980.
tion and S the solution we wish to measure. Denote by n11
the number of pairs of elements that are in the same cluster [8] S. Bandyopadhyay, “An automatic shape independent clus-
in both S and T. Denote by n01 the number of pairs that tering technique,” Pattern Recognition, vol. 37, pp. 33–45,
2004.
are in the same cluster only in S, and by n10 the number of
pairs that are in the same cluster in 
T. Minkowski Score (MS) [9] S. K. Pal, S. Bandyopadhyay, and C. A. Murthy, “Genetic
n01 +n10 algorithms for generation of class boundaries,” IEEE Trans.
is then defined as: M S(T, S) = n11 +n10 .. For MS, the System Man Cybernet, vol. 28, no. 6, pp. 816–828, 1998.
optimum score is 0, with lower scores being “better”. For Iris
data set, MS value corresponding to the partitioning obtained [10] J. Handl and J. Knowles, “An evolutionary approach to mul-
by Single Linkage clustering for K = 2 is 0.88. Again for tiobjective clustering,” IEEE Transactions on Evolutionary
Computation, vol. 11, no. 1, pp. 56–76, 2007.
Cancer data set, Single Linkage clustering technique obtains
a MS of 0.43 for K = 2 (number of partitions indicated by [11] A. Ben-Hur and I. Guyon, Detecting Stable Clusters using
newly proposed connect-index) while that of K = 6 (number Principal Component Analysis in Methods in Molecular Bi-
of partitions indicated by Dunn’s index) is 1.45. ology. Humana press, 2003.

94

You might also like