Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Accelerating k-Means Clustering with Cover Trees

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2023)

Abstract

The k-means clustering algorithm is a popular algorithm that partitions data into k clusters. There are many improvements to accelerate the standard algorithm. Most current research employs upper and lower bounds on point-to-cluster distances and the triangle inequality to reduce the number of distance computations, with only arrays as underlying data structures. These approaches cannot exploit that nearby points are likely assigned to the same cluster. We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree. By combining this with upper and lower bounds, as in state-of-the-art approaches, we obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.

Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG), project number 124020371, within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, project A2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    It is also possible to stop early centers’ movement is below some threshold.

References

  1. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1027–1035 (2007)

    Google Scholar 

  2. Beygelzimer, A., Kakade, S.M., Langford, J.: Cover trees for nearest neighbor. In: International Conference on Machine Learning, ICML, pp. 97–104 (2006). https://doi.org/10.1145/1143844.1143857

  3. Borgelt, C.: Even faster exact k-means clustering. In: International Symposium on Intelligent Data Analysis, IDA, pp. 93–105 (2020). https://doi.org/10.1007/978-3-030-44584-3_8

  4. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: KDD

    Google Scholar 

  5. Elkan, C.: Using the triangle inequality to accelerate k-means. In: International Conference on Machine Learning, ICML, pp. 147–153 (2003)

    Google Scholar 

  6. Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C.: BICO: BIRCH meets coresets for k-means clustering. In: ESA, pp. 481–492 (2013). https://doi.org/10.1007/978-3-642-40450-4_41

  7. Hamerly, G.: Making k-means even faster. In: SIAM Data Mining, SDM, pp. 130–140 (2010). https://doi.org/10.1137/1.9781611972801.12

  8. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002). https://doi.org/10.1109/TPAMI.2002.1017616

    Article  MATH  Google Scholar 

  9. Kriegel, H., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2017). https://doi.org/10.1007/s10115-016-1004-2

    Article  Google Scholar 

  10. Lang, A., Schubert, E.: BETULA: fast clustering of large data with improved BIRCH CF-trees. Inf. Syst. 108, 101918 (2022). https://doi.org/10.1016/j.is.2021.101918

    Article  Google Scholar 

  11. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–136 (1982). https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  12. Mahajan, M., Nimbhorkar, P., Varadarajan, K.R.: The planar k-means problem is NP-hard. In: WALCOM: Algorithms and Computation, pp. 274–285 (2009). https://doi.org/10.1007/978-3-642-00202-1_24

  13. Newling, J., Fleuret, F.: Fast k-means with accurate bounds. In: International Conference on Machine Learning, vol. 48, pp. 936–944 (2016)

    Google Scholar 

  14. Pelleg, D., Moore, A.W.: Accelerating exact k-means algorithms with geometric reasoning. In: KDD, pp. 277–281 (1999) https://doi.org/10.1145/312129.312248

  15. Phillips, S.J.: Acceleration of k-means and related clustering algorithms. In: International Workshop on Algorithm Engineering and Experiments, ALENEX, pp. 166–177 (2002). https://doi.org/10.1007/3-540-45643-0_13

  16. Pollard, D.: Strong consistency of k-means clustering. Ann. Stat. 9(1), 135–140 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  17. Schubert, E.: A triangle inequality for cosine similarity. In: Similarity Search and Applications, SISAP, pp. 32–44 (2021). https://doi.org/10.1007/978-3-030-89657-7_3

  18. Schubert, E.: Automatic indexing for similarity search in ELKI. In: Similarity Search and Applications, SISAP (2022). https://doi.org/10.1007/978-3-031-17849-8_16

  19. Schubert, E.: Stop using the elbow criterion for k-means and how to choose the number of clusters instead. SIGKDD Explor. 25(1), 36–42 (2023). https://doi.org/10.1145/3606274.3606278

    Article  Google Scholar 

  20. Schubert, E., Lang, A., Feher, G.: Accelerating spherical k-means. In: Similarity Search and Applications, SISAP, pp. 217–231 (2021). https://doi.org/10.1007/978-3-030-89657-7_17

  21. Schubert, E., Zimek, A.: ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI). https://doi.org/10.5281/zenodo.6355684. Zenodo, June 2010

  22. Sculley, D.: Web-scale k-means clustering. In: World Wide Web, WWW, pp. 1177–1178 (2010). https://doi.org/10.1145/1772690.1772862

  23. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci 1, 801–804 (1956)

    MathSciNet  MATH  Google Scholar 

  24. Yu, Q., Chen, K., Chen, J.: Using a set of triangle inequalities to accelerate k-means clustering. In: Similarity Search and Applications, SISAP, pp. 297–311 (2020). https://doi.org/10.1007/978-3-030-60936-8_23

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Lang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lang, A., Schubert, E. (2023). Accelerating k-Means Clustering with Cover Trees. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46994-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46993-0

  • Online ISBN: 978-3-031-46994-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics