article

Stop using the elbow criterion for k-means and how to choose the number of clusters instead

Author:

Erich SchubertAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 25, Issue 1

Pages 36 - 42

https://doi.org/10.1145/3606274.3606278

Published: 05 July 2023 Publication History

Abstract

A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better. This letter is a call to stop using the elbow method altogether, because it severely lacks theoretic support, and we want to encourage educators to discuss the problems of the method - if introducing it in class at all - and teach alternatives instead, while researchers and reviewers should reject conclusions drawn from the elbow method.

References

[1]

Aloise, D., Deshpande, A., Hansen, P., and Popat, P. NP-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75, 2 (2009), 245--248.

Digital Library

[2]

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P´erez, J. M., and Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 46, 1 (2013), 243--256.

Digital Library

[3]

Bishop, C. M. Pattern recognition and machine learning, 5th Edition. Information science and statistics. Springer, 2007.

[4]

Bock, H.-H. Clustering Methods: A History of k- Means Algorithms. Springer, 2007, pp. 161--172.

[5]

Bonner, R. E. On some clustering techniques. IBM J. Res. Dev. 8, 1 (1964), 22--32.

Digital Library

[6]

Cali´nski, T., and Harabasz, J. A dendrite method for cluster analysis. Communications in Statistics 3, 1 (1974), 1--27.

[7]

Davies, D. L., and Bouldin, D. W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 2 (1979), 224--227.

Digital Library

[8]

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1 (1977), 1--22.

[9]

Dhillon, I. S., and Modha, D. S. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1/2 (2001), 143--175.

[10]

Dunn, J. C. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3, 3 (1973), 32--57.

[11]

Ester, M., Kriegel, H., Sander, J., and Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, KDD (1996), pp. 226--231.

[12]

Estivill-Castro, V. Why so many clustering algorithms: a position paper. SIGKDD Explor. 4, 1 (2002), 65--75.

Digital Library

[13]

Foglia, A., and Hancock, B. Notes on bayesian information criterion calculation for x-means clustering, 2012.

[14]

Friedman, H. P., and Rubin, J. On some invariant criteria for grouping data. Journal of the American Statistical Association 62, 320 (1967), 1159--1178.

[15]

Hamerly, G., and Elkan, C. Learning the k in k-means. In Neural Information Processing Systems, NIPS (2003), pp. 281--288.

[16]

Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 3 (1998), 283--304.

Digital Library

[17]

Kaufman, L., and Rousseeuw, P. J. Clustering by means of medoids. In Statistical Data Analysis Based on the L1 Norm and Related Methods, Y. Dodge, Ed. North-Holland, 1987, pp. 405--416.

[18]

Kaufman, L., and Rousseeuw, P. J. Partitioning around medoids (program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Ltd, 1990, ch. 2, pp. 68--125.

[19]

Ketchen, D. J., and Shook, C. L. The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal 17, 6 (1996), 441--458.

[20]

Krzanowski, W. J., and Lai, Y. T. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 1 (1988), 23-- 34.

[21]

Lenssen, L., and Schubert, E. Clustering by direct optimization of the medoid silhouette. In Similarity Search and Applications (2022), pp. 190--204.

Digital Library

[22]

Mahajan, M., Nimbhorkar, P., and Varadarajan, K. R. The planar k-means problem is NP-hard. In WALCOM: Algorithms and Computation (2009), pp. 274--285.

Digital Library

[23]

Marriott, F. H. C. Practical problems in a method of cluster analysis. Biometrics 27, 3 (1971), 501--514.

[24]

Milligan, G. W., and Cooper, M. C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 2 (June 1985), 159--179.

[25]

Novikov, A. PyClustering: Data mining library. Journal of Open Source Software 4, 36 (2019), 1230.

[26]

Onumanyi, A. J., Molokomme, D. N., Isaac, S. J., and Abu-Mahfouz, A. M. Autoelbow: An automatic elbow detection method for estimating the number of clusters in a dataset. Applied Sciences 12, 15 (2022).

[27]

Pelleg, D., and Moore, A. W. X-means: Extending k-means with efficient estimation of the number of clusters. In Int. Conf. Machine Learning (ICML) (2000), pp. 727--734.

[28]

Pham, D. T., Dimov, S. S., and Nguyen, C. D. Selection of k in k-means clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 219, 1 (2005), 103--119.

[29]

Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65.

Digital Library

[30]

Salvador, S., and Chan, P. Determining the number of clusters/segments in hierarchical clustering/ segmentation algorithms. In Tools with Artificial Intelligence (ICTAI) (2004), pp. 576--584.

Digital Library

[31]

Satop¨a¨a, V., Albrecht, J. R., Irwin, D. E., and Raghavan, B. Finding a "kneedle" in a haystack: Detecting knee points in system behavior. In Distributed Computing Systems (ICDCS) Workshops (2011), pp. 166--171.

Digital Library

[32]

Schubert, E., Lang, A., and Feher, G. Accelerating spherical k-means. In Similarity Search and Applications (2021), pp. 217--231.

Digital Library

[33]

Schubert, E., and Rousseeuw, P. J. Faster kmedoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications, SISAP (2019), pp. 171--187.

[34]

Schubert, E., and Rousseeuw, P. J. Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inf. Syst. 101 (2021), 101804.

[35]

Schubert, E., Sander, J., Ester, M., Kriegel, H., and Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 19:1--19:21.

Digital Library

[36]

Schwarz, G. Estimating the Dimension of a Model. The Annals of Statistics 6, 2 (1978), 461 -- 464.

[37]

Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., and Liu, J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wirel. Commun. Netw. 2021, 1 (2021), 31.

Digital Library

[38]

Sugar, C. A., and James, G. M. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association 98, 463 (2003), 750--763.

[39]

Thorndike, R. L. Who belongs in the family? Psychometrika 18, 4 (1953), 267--276.

[40]

Tibshirani, R., Walther, G., and Hastie, T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 2 (2001), 411--423.

[41]

Van der Laan, M., Pollard, K., and Bryan, J. A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation 73, 8 (2003), 575--584.

[42]

Zhang, Y., Mandziuk, J., Quek, H. C., and Goh, W. Curvature-based method for determining the number of clusters. Inf. Sci. 415 (2017), 414--428.

Cited By

Tungom CNiu BWang H(2025)SWAPP: Swarm precision policy optimization with dynamic action bound adjustment for energy management in smart citiesApplied Energy10.1016/j.apenergy.2024.124410377(124410)Online publication date: Jan-2025
https://doi.org/10.1016/j.apenergy.2024.124410
khan IDaud HZainuddin NSokkalingam RAbdussamad Museeb AInayat A(2024)Addressing limitations of the K-means clustering algorithm: outliers, non-spherical data, and optimal cluster selectionAIMS Mathematics10.3934/math.202412229:9(25070-25097)Online publication date: 2024
https://doi.org/10.3934/math.20241222
Uzoamaka CSunday Ade B(2024)Relationship between Artificial Intelligence and Business Process Optimization: Insights from Selected Banks in Anambra StateInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24JUN1673(2162-2171)Online publication date: 9-Jul-2024
https://doi.org/10.38124/ijisrt/IJISRT24JUN1673
Show More Cited By

Recommendations

Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters

In this paper, we present an agglomerative fuzzy $k$-means clustering algorithm for numerical data, an extension to the standard fuzzy $k$-means algorithm by introducing a penalty term to the objective function to make the clustering process not ...
Experiments for the number of clusters in K-means
EPIA'07: Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence

K-means is one of the most popular data mining and unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a pre-specified number of clusters K, ...
A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of different groups. Though many effective and efficient clustering algorithms have been ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 25, Issue 1

June 2023

82 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/3606274

Issue’s Table of Contents

Copyright © 2023 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2023

Published in SIGKDD Volume 25, Issue 1

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
673
Total Downloads

Downloads (Last 12 months)482
Downloads (Last 6 weeks)52

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tungom CNiu BWang H(2025)SWAPP: Swarm precision policy optimization with dynamic action bound adjustment for energy management in smart citiesApplied Energy10.1016/j.apenergy.2024.124410377(124410)Online publication date: Jan-2025
https://doi.org/10.1016/j.apenergy.2024.124410
khan IDaud HZainuddin NSokkalingam RAbdussamad Museeb AInayat A(2024)Addressing limitations of the K-means clustering algorithm: outliers, non-spherical data, and optimal cluster selectionAIMS Mathematics10.3934/math.202412229:9(25070-25097)Online publication date: 2024
https://doi.org/10.3934/math.20241222
Uzoamaka CSunday Ade B(2024)Relationship between Artificial Intelligence and Business Process Optimization: Insights from Selected Banks in Anambra StateInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24JUN1673(2162-2171)Online publication date: 9-Jul-2024
https://doi.org/10.38124/ijisrt/IJISRT24JUN1673
Nadimi RNazarahari ATokimatsu K(2024)Measuring Household Thermal Discomfort Time: A Japanese Case StudySustainability10.3390/su1619845716:19(8457)Online publication date: 28-Sep-2024
https://doi.org/10.3390/su16198457
Li LWang HWang S(2024)Compressing and Recovering Short-Range MEMS-Based LiDAR Point Clouds Based on Adaptive Clustered Compressive Sensing and Application to 3D Rock Fragment Surface Point CloudsSensors10.3390/s2417569524:17(5695)Online publication date: 1-Sep-2024
https://doi.org/10.3390/s24175695
Cardone BDi Martino FMauriello CMiraglia V(2024)A GIS-Based Framework to Analyze the Behavior of Urban Greenery During Heatwaves Using Satellite DataISPRS International Journal of Geo-Information10.3390/ijgi1311037713:11(377)Online publication date: 30-Oct-2024
https://doi.org/10.3390/ijgi13110377
Colucci SDonini FDi Sciascio E(2024)Computing the Commonalities of Clusters in Resource Description Framework: Computational AspectsData10.3390/data91001219:10(121)Online publication date: 20-Oct-2024
https://doi.org/10.3390/data9100121
Wehrli SDwyer ABaumgartner MLehmann CLandolt M(2024)Lower Healthcare Access and Its Association With Individual Factors and Health-Related Quality of Life in Adults With Rare Diseases in SwitzerlandInternational Journal of Public Health10.3389/ijph.2024.160754869Online publication date: 25-Sep-2024
https://doi.org/10.3389/ijph.2024.1607548
Stolarz MMitrevski AWasil MPlöger P(2024)Learning-based personalisation of robot behaviour for robot-assisted therapyFrontiers in Robotics and AI10.3389/frobt.2024.135215211Online publication date: 8-Apr-2024
https://doi.org/10.3389/frobt.2024.1352152
Banik SSen SSaha SGhosh S(2024)Towards Reducing Continuous Emotion Annotation Effort During Video Consumption: A Physiological Response Profiling ApproachProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785698:3(1-32)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3678569
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents