Shahapure 2020
Shahapure 2020
Shahapure 2020
Abstract—Clustering is an important phase in data mining. 1) Obtain the data in the form of tuples.
Selecting the number of clusters in a clustering algorithm, e.g. 2) Run sklearn’s k-means algorithm on the data set.
choosing the best value of k in the various k-means algorithms 3) Obtain cluster labels by fitting the data.
[1], can be difficult. We studied the use of silhouette scores 4) Calculate mean silhouette coefficient by passing
and scatter plots to suggest, and then validate, the number tuples and cluster labels.
of clusters we specified in running the k-means clustering 5) Repeat this for different numbers (i.e. values of k )
algorithm on two publicly available data sets. Scikit-learn’s of clusters.
[4] silhouette score method, which is a measure of the quality
of a cluster, was used to find the mean silhouette co-efficient of 2. Quality measurement
all the samples for different number of clusters. The highest
silhouette score indicates the optimal number of clusters. We Scikit-learn’s silhouette score function computes the
present several instances of utilizing the silhouette score to mean silhouette coefficient of all samples. The silhouette
determine the best value of k for those data sets. coefficient is calculated by taking into account the mean
intra-cluster distance a and the mean nearest-cluster distance
1. Introduction b for each data point. The silhouette coefficient for a sample
is (b − a)/max(a, b).
Determining the optimal number of clusters for a data
• A silhouette score with a value near + 1 means the
set is an important problem in certain clustering algorithms,
data point is in the correct cluster.
especially the well-known k -means and similar algorithms
• A silhouette score with a value near 0 means the
[1]. There is no one-size-fits-all method to determine the
data point might belong in some other cluster.
value of k , the optimal value for a given data set may well
• A silhouette score with a value near -1 means, the
depend on the methods used for measuring similarities and
data point is in (a) wrong cluster.
the initial seed values used for partitioning. A solution is to
inspect the dendrogram resulting from hierarchical cluster- The analysis of silhouette scores for different data sets is
ing, but this remains a somewhat subjective and expensive given below.
approach, since hierarchical clustering is intrinsically slower
than k-means. Hierarchical clustering could still be applied 2.1. Iris data set
on several small subsets of the data, to find a reasonable
estimate of k . We choose the more direct method of analyz- This is a classic multi-class classification data set pro-
ing the silhouette scores [5] which measure the quality of vided by scikit-learn. The data set consists of 3 classes, 4
clusters. A high average silhouette coefficient value indicates dimensions or features, and 150 samples. Figure 1 shows
good clustering and helps in deciding the optimal value of
the number of clusters k [3]. We present examples of this
approach, along with 2-d and 3-d scatter plots to support if
not validate the results.
We propose to investigate whether the silhouette score
can be used for validation of the number of clusters obtained
by running k -means clustering algorithm on each of several
data sets. Dimensionality reduction is done to reduce the
number of features and generate a 2D or 3D scatter plot Figure 1. Silhouette scores for Iris data set
which helps in visually analyzing the number of clusters
and validating the result. The following steps are carried the silhouette scores for different number of clusters with k
for the analysis ranging from 2 to 10. It can be observed that the silhouette
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 11:46:49 UTC from IEEE Xplore. Restrictions apply.
score is the highest for k = 2. In addition, selecting k = 4
or k = 5 results in silhouette scores that are more or less
equally bad. Therefore, k = 2 or k = 3 are the only two
reasonable choices for this data set.
References
[1] Edward W Forgy. Cluster analysis of multivariate data: efficiency
versus interpretability of classifications. Biometrics, 21:768–769, 1965.
[2] Pasi Fränti and Sami Sieranoja. k-means properties on six clustering
benchmark datasets. Applied Intelligence, 48(12):4743–4759, 2018.
Figure 3. Silhouette score for S-1 data set [3] Alboukadel Kassambara. Determining the optimal number of
clusters: 3 must know methods. Available online: https://www.
that the silhouette score for k = 15 is the highest, which is datanovia. com/en/lessons/determiningthe-optimal-number-of-clusters-
3-must-know-methods/.(accessed on 31 April 2018), 2017.
what we expect for this data set. Visual inspection of the
[4] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
2-d scatter plot in Figure 4 supports the claim. O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
Results based on more data sets are presented in a plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
longer companion paper [6]. In those results, some values esnay. Scikit-learn: Machine learning in Python. Journal of Machine
of k resulted in silhouette scores that were very close, Learning Research, 12:2825–2830, 2011.
suggesting that in some data sets there might be two (or [5] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation
more?) excellent choices for k . This is a question for future and validation of cluster analysis. Journal of Computational and
Applied Mathematics, 20:53 – 65, 1987.
work.
[6] Ketan Rajshekhar Shahapure. Cluster quality analysis using silhouette
score, 2020. M.S. Writing Project, Department of Computer Science
3. Conclusion and Electrical Engineering, UMBC.
[7] Vladimir Volkovich, Jacob Kogan, and Charles Nicholas. Building
The silhouette score was obtained for different values of initial partitions through sampling techniques. European Journal of
k for several data sets. The data was subjected to dimension- Operational Research, 183(3):1097–1105, 2007.
748
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 16,2021 at 11:46:49 UTC from IEEE Xplore. Restrictions apply.