Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Stop using the elbow criterion for k-means and how to choose the number of clusters instead

Published: 05 July 2023 Publication History

Abstract

A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better. This letter is a call to stop using the elbow method altogether, because it severely lacks theoretic support, and we want to encourage educators to discuss the problems of the method - if introducing it in class at all - and teach alternatives instead, while researchers and reviewers should reject conclusions drawn from the elbow method.

References

[1]
Aloise, D., Deshpande, A., Hansen, P., and Popat, P. NP-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75, 2 (2009), 245--248.
[2]
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P´erez, J. M., and Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 46, 1 (2013), 243--256.
[3]
Bishop, C. M. Pattern recognition and machine learning, 5th Edition. Information science and statistics. Springer, 2007.
[4]
Bock, H.-H. Clustering Methods: A History of k- Means Algorithms. Springer, 2007, pp. 161--172.
[5]
Bonner, R. E. On some clustering techniques. IBM J. Res. Dev. 8, 1 (1964), 22--32.
[6]
Cali´nski, T., and Harabasz, J. A dendrite method for cluster analysis. Communications in Statistics 3, 1 (1974), 1--27.
[7]
Davies, D. L., and Bouldin, D. W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 2 (1979), 224--227.
[8]
Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1 (1977), 1--22.
[9]
Dhillon, I. S., and Modha, D. S. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1/2 (2001), 143--175.
[10]
Dunn, J. C. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3, 3 (1973), 32--57.
[11]
Ester, M., Kriegel, H., Sander, J., and Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, KDD (1996), pp. 226--231.
[12]
Estivill-Castro, V. Why so many clustering algorithms: a position paper. SIGKDD Explor. 4, 1 (2002), 65--75.
[13]
Foglia, A., and Hancock, B. Notes on bayesian information criterion calculation for x-means clustering, 2012.
[14]
Friedman, H. P., and Rubin, J. On some invariant criteria for grouping data. Journal of the American Statistical Association 62, 320 (1967), 1159--1178.
[15]
Hamerly, G., and Elkan, C. Learning the k in k-means. In Neural Information Processing Systems, NIPS (2003), pp. 281--288.
[16]
Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 3 (1998), 283--304.
[17]
Kaufman, L., and Rousseeuw, P. J. Clustering by means of medoids. In Statistical Data Analysis Based on the L1 Norm and Related Methods, Y. Dodge, Ed. North-Holland, 1987, pp. 405--416.
[18]
Kaufman, L., and Rousseeuw, P. J. Partitioning around medoids (program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Ltd, 1990, ch. 2, pp. 68--125.
[19]
Ketchen, D. J., and Shook, C. L. The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal 17, 6 (1996), 441--458.
[20]
Krzanowski, W. J., and Lai, Y. T. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 1 (1988), 23-- 34.
[21]
Lenssen, L., and Schubert, E. Clustering by direct optimization of the medoid silhouette. In Similarity Search and Applications (2022), pp. 190--204.
[22]
Mahajan, M., Nimbhorkar, P., and Varadarajan, K. R. The planar k-means problem is NP-hard. In WALCOM: Algorithms and Computation (2009), pp. 274--285.
[23]
Marriott, F. H. C. Practical problems in a method of cluster analysis. Biometrics 27, 3 (1971), 501--514.
[24]
Milligan, G. W., and Cooper, M. C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 2 (June 1985), 159--179.
[25]
Novikov, A. PyClustering: Data mining library. Journal of Open Source Software 4, 36 (2019), 1230.
[26]
Onumanyi, A. J., Molokomme, D. N., Isaac, S. J., and Abu-Mahfouz, A. M. Autoelbow: An automatic elbow detection method for estimating the number of clusters in a dataset. Applied Sciences 12, 15 (2022).
[27]
Pelleg, D., and Moore, A. W. X-means: Extending k-means with efficient estimation of the number of clusters. In Int. Conf. Machine Learning (ICML) (2000), pp. 727--734.
[28]
Pham, D. T., Dimov, S. S., and Nguyen, C. D. Selection of k in k-means clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 219, 1 (2005), 103--119.
[29]
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65.
[30]
Salvador, S., and Chan, P. Determining the number of clusters/segments in hierarchical clustering/ segmentation algorithms. In Tools with Artificial Intelligence (ICTAI) (2004), pp. 576--584.
[31]
Satop¨a¨a, V., Albrecht, J. R., Irwin, D. E., and Raghavan, B. Finding a "kneedle" in a haystack: Detecting knee points in system behavior. In Distributed Computing Systems (ICDCS) Workshops (2011), pp. 166--171.
[32]
Schubert, E., Lang, A., and Feher, G. Accelerating spherical k-means. In Similarity Search and Applications (2021), pp. 217--231.
[33]
Schubert, E., and Rousseeuw, P. J. Faster kmedoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications, SISAP (2019), pp. 171--187.
[34]
Schubert, E., and Rousseeuw, P. J. Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inf. Syst. 101 (2021), 101804.
[35]
Schubert, E., Sander, J., Ester, M., Kriegel, H., and Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 19:1--19:21.
[36]
Schwarz, G. Estimating the Dimension of a Model. The Annals of Statistics 6, 2 (1978), 461 -- 464.
[37]
Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., and Liu, J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wirel. Commun. Netw. 2021, 1 (2021), 31.
[38]
Sugar, C. A., and James, G. M. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association 98, 463 (2003), 750--763.
[39]
Thorndike, R. L. Who belongs in the family? Psychometrika 18, 4 (1953), 267--276.
[40]
Tibshirani, R., Walther, G., and Hastie, T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 2 (2001), 411--423.
[41]
Van der Laan, M., Pollard, K., and Bryan, J. A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation 73, 8 (2003), 575--584.
[42]
Zhang, Y., Mandziuk, J., Quek, H. C., and Goh, W. Curvature-based method for determining the number of clusters. Inf. Sci. 415 (2017), 414--428.

Cited By

View all
  • (2025)SWAPP: Swarm precision policy optimization with dynamic action bound adjustment for energy management in smart citiesApplied Energy10.1016/j.apenergy.2024.124410377(124410)Online publication date: Jan-2025
  • (2024)Addressing limitations of the K-means clustering algorithm: outliers, non-spherical data, and optimal cluster selectionAIMS Mathematics10.3934/math.202412229:9(25070-25097)Online publication date: 2024
  • (2024)Relationship between Artificial Intelligence and Business Process Optimization: Insights from Selected Banks in Anambra StateInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24JUN1673(2162-2171)Online publication date: 9-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 25, Issue 1
June 2023
82 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/3606274
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2023
Published in SIGKDD Volume 25, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)482
  • Downloads (Last 6 weeks)52
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)SWAPP: Swarm precision policy optimization with dynamic action bound adjustment for energy management in smart citiesApplied Energy10.1016/j.apenergy.2024.124410377(124410)Online publication date: Jan-2025
  • (2024)Addressing limitations of the K-means clustering algorithm: outliers, non-spherical data, and optimal cluster selectionAIMS Mathematics10.3934/math.202412229:9(25070-25097)Online publication date: 2024
  • (2024)Relationship between Artificial Intelligence and Business Process Optimization: Insights from Selected Banks in Anambra StateInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24JUN1673(2162-2171)Online publication date: 9-Jul-2024
  • (2024)Measuring Household Thermal Discomfort Time: A Japanese Case StudySustainability10.3390/su1619845716:19(8457)Online publication date: 28-Sep-2024
  • (2024)Compressing and Recovering Short-Range MEMS-Based LiDAR Point Clouds Based on Adaptive Clustered Compressive Sensing and Application to 3D Rock Fragment Surface Point CloudsSensors10.3390/s2417569524:17(5695)Online publication date: 1-Sep-2024
  • (2024)A GIS-Based Framework to Analyze the Behavior of Urban Greenery During Heatwaves Using Satellite DataISPRS International Journal of Geo-Information10.3390/ijgi1311037713:11(377)Online publication date: 30-Oct-2024
  • (2024)Computing the Commonalities of Clusters in Resource Description Framework: Computational AspectsData10.3390/data91001219:10(121)Online publication date: 20-Oct-2024
  • (2024)Lower Healthcare Access and Its Association With Individual Factors and Health-Related Quality of Life in Adults With Rare Diseases in SwitzerlandInternational Journal of Public Health10.3389/ijph.2024.160754869Online publication date: 25-Sep-2024
  • (2024)Learning-based personalisation of robot behaviour for robot-assisted therapyFrontiers in Robotics and AI10.3389/frobt.2024.135215211Online publication date: 8-Apr-2024
  • (2024)Towards Reducing Continuous Emotion Annotation Effort During Video Consumption: A Physiological Response Profiling ApproachProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785698:3(1-32)Online publication date: 9-Sep-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media