Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Shahapure 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)

Cluster Quality Analysis Using Silhouette Score

Ketan Rajshekhar Shahapure Charles Nicholas


Department of Computer Science Department of Computer Science
and Electrical Engineering and Electrical Engineering
University of Maryland, Baltimore County University of Maryland, Baltimore County
Email: ketans1@umbc.edu Email: nicholas@umbc.edu

Abstract—Clustering is an important phase in data mining. 1) Obtain the data in the form of tuples.
Selecting the number of clusters in a clustering algorithm, e.g. 2) Run sklearn’s k-means algorithm on the data set.
choosing the best value of k in the various k-means algorithms 3) Obtain cluster labels by fitting the data.
[1], can be difficult. We studied the use of silhouette scores 4) Calculate mean silhouette coefficient by passing
and scatter plots to suggest, and then validate, the number tuples and cluster labels.
of clusters we specified in running the k-means clustering 5) Repeat this for different numbers (i.e. values of k )
algorithm on two publicly available data sets. Scikit-learn’s of clusters.
[4] silhouette score method, which is a measure of the quality
of a cluster, was used to find the mean silhouette co-efficient of 2. Quality measurement
all the samples for different number of clusters. The highest
silhouette score indicates the optimal number of clusters. We Scikit-learn’s silhouette score function computes the
present several instances of utilizing the silhouette score to mean silhouette coefficient of all samples. The silhouette
determine the best value of k for those data sets. coefficient is calculated by taking into account the mean
intra-cluster distance a and the mean nearest-cluster distance
1. Introduction b for each data point. The silhouette coefficient for a sample
is (b − a)/max(a, b).
Determining the optimal number of clusters for a data
• A silhouette score with a value near + 1 means the
set is an important problem in certain clustering algorithms,
data point is in the correct cluster.
especially the well-known k -means and similar algorithms
• A silhouette score with a value near 0 means the
[1]. There is no one-size-fits-all method to determine the
data point might belong in some other cluster.
value of k , the optimal value for a given data set may well
• A silhouette score with a value near -1 means, the
depend on the methods used for measuring similarities and
data point is in (a) wrong cluster.
the initial seed values used for partitioning. A solution is to
inspect the dendrogram resulting from hierarchical cluster- The analysis of silhouette scores for different data sets is
ing, but this remains a somewhat subjective and expensive given below.
approach, since hierarchical clustering is intrinsically slower
than k-means. Hierarchical clustering could still be applied 2.1. Iris data set
on several small subsets of the data, to find a reasonable
estimate of k . We choose the more direct method of analyz- This is a classic multi-class classification data set pro-
ing the silhouette scores [5] which measure the quality of vided by scikit-learn. The data set consists of 3 classes, 4
clusters. A high average silhouette coefficient value indicates dimensions or features, and 150 samples. Figure 1 shows
good clustering and helps in deciding the optimal value of
the number of clusters k [3]. We present examples of this
approach, along with 2-d and 3-d scatter plots to support if
not validate the results.
We propose to investigate whether the silhouette score
can be used for validation of the number of clusters obtained
by running k -means clustering algorithm on each of several
data sets. Dimensionality reduction is done to reduce the
number of features and generate a 2D or 3D scatter plot Figure 1. Silhouette scores for Iris data set
which helps in visually analyzing the number of clusters
and validating the result. The following steps are carried the silhouette scores for different number of clusters with k
for the analysis ranging from 2 to 10. It can be observed that the silhouette

978-1-7281-8206-3/20/$31.00 ©2020 IEEE 747


DOI 10.1109/DSAA49011.2020.00096

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 11:46:49 UTC from IEEE Xplore. Restrictions apply.
score is the highest for k = 2. In addition, selecting k = 4
or k = 5 results in silhouette scores that are more or less
equally bad. Therefore, k = 2 or k = 3 are the only two
reasonable choices for this data set.

Figure 4. 2-d scatter plot for S-1 data set

ality reduction and plotted. We observed that the silhouette


score provides a way to find a good value of k to specify
in k -means clustering algorithms.
Figure 2. 2-d scatter plot for Iris data set
4. Future Work
Inspection of Figure 2 shows that the Iris data set can We want to continue this work with larger data sets, to
be clustered into either 2 or 3 distinct clusters. In fact, we explore the question of similar silhouette scores for different
suppose that most people would say that k = 2 would be values of k . According to Volkovich et al. (2007) [7],
obvious. However, the silhouette score suggests that k = 3 sampling can help find the best value of k, as well as good
is also a reasonable choice. candidates for initial seed values for the clusters. If the data
set is quite large, sampling to suggest k as well as initial
2.2. Clustering Basic Benchmark S-1 set seed values following approaches could prove beneficial.
The S-1 data set [2] is widely used for benchmarking For example, we could take a small random sample, say
of clustering algorithms. The 2-D data is synthesized, con- 1:100000, and use our results to suggest a value of k and the
sisting of N=5000 data points, and k=15 Gaussian clusters initial seed values. Done repeatedly, we could gain statistical
with different degrees of cluster overlap. In Figure 3 we see confidence, so to speak, in the resulting value of k . If we
keep track of the centroids produced after each of those
sampled runs, that might lead to a better choice of initial
centroids while running k-means on the entire data set.

References
[1] Edward W Forgy. Cluster analysis of multivariate data: efficiency
versus interpretability of classifications. Biometrics, 21:768–769, 1965.
[2] Pasi Fränti and Sami Sieranoja. k-means properties on six clustering
benchmark datasets. Applied Intelligence, 48(12):4743–4759, 2018.
Figure 3. Silhouette score for S-1 data set [3] Alboukadel Kassambara. Determining the optimal number of
clusters: 3 must know methods. Available online: https://www.
that the silhouette score for k = 15 is the highest, which is datanovia. com/en/lessons/determiningthe-optimal-number-of-clusters-
3-must-know-methods/.(accessed on 31 April 2018), 2017.
what we expect for this data set. Visual inspection of the
[4] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
2-d scatter plot in Figure 4 supports the claim. O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
Results based on more data sets are presented in a plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
longer companion paper [6]. In those results, some values esnay. Scikit-learn: Machine learning in Python. Journal of Machine
of k resulted in silhouette scores that were very close, Learning Research, 12:2825–2830, 2011.
suggesting that in some data sets there might be two (or [5] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation
more?) excellent choices for k . This is a question for future and validation of cluster analysis. Journal of Computational and
Applied Mathematics, 20:53 – 65, 1987.
work.
[6] Ketan Rajshekhar Shahapure. Cluster quality analysis using silhouette
score, 2020. M.S. Writing Project, Department of Computer Science
3. Conclusion and Electrical Engineering, UMBC.
[7] Vladimir Volkovich, Jacob Kogan, and Charles Nicholas. Building
The silhouette score was obtained for different values of initial partitions through sampling techniques. European Journal of
k for several data sets. The data was subjected to dimension- Operational Research, 183(3):1097–1105, 2007.

748

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 11:46:49 UTC from IEEE Xplore. Restrictions apply.

You might also like