Sciendo

Romanuke, Vadim

Open Access

Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

Vadim Romanuke

Romanuke, Vadim

| Aug 17, 2023

Applied Computer Systems

Volume 28 (2023): Issue 1 (June 2023)

About this article

Cite

Page range: 1 - 12

DOI: https://doi.org/10.2478/acss-2023-0001

Keywords
Flat objects, initial centroids, k-means, large dataset, speedup

This work is licensed under the Creative Commons Attribution 4.0 International License.

A problem of partitioning large datasets of flat points is considered. Known as the centroid-based clustering problem, it is mainly addressed by the k-means algorithm and its modifications. As the k-means performance becomes poorer on large datasets, including the dataset shape stretching, the goal is to study a possibility of improving the centroid-based clustering for such cases. It is quite noticeable on non-sparse datasets that the resulting clusters produced by k-means resemble beehive honeycomb. It is natural for rectangular-shaped datasets because the hexagonal cells make efficient use of space owing to which the sum of the within-cluster squared Euclidean distances to the centroids is approximated to its minimum. Therefore, the lattices of rectangular and hexagonal clusters, consisting of stretched rectangles and regular hexagons, are suggested to be successively applied. Then the initial centroids are calculated by averaging within respective hexagons. These centroids are used as initial seeds to start the k-means algorithm. This ensures faster and more accurate convergence, where at least the expected speedup is 1.7 to 2.1 times by a 0.7 to 0.9 % accuracy gain. The lattice of rectangular clusters applied first makes rather rough but effective partition allowing to optionally run further clustering on parallel processor cores. The lattice of hexagonal clusters applied to every rectangle allows obtaining initial centroids very quickly. Such centroids are far closer to the solution than the initial centroids in the k-means++ algorithm. Another approach to the k-means update, where initial centroids are selected separately within every rectangle hexagons, can be used as well. It is faster than selecting initial centroids across all hexagons but is less accurate. The speedup is 9 to 11 times by a possible accuracy loss of 0.3 %. However, this approach may outperform the k-means algorithm. The speedup increases as both the lattices become denser and the dataset becomes larger reaching 30 to 50 times.

eISSN:: 2255-8691
Language:: English

Publication timeframe:: 2 times per year
Journal Subjects:: Computer Sciences, Artificial Intelligence, Information Technology, Project Management, Software Development

Journal RSS Feed

Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

Published Online: Aug 17, 2023

Page range: 1 - 12

DOI: https://doi.org/10.2478/acss-2023-0001

Keywords
Flat objects, initial centroids, k-means, large dataset, speedup

© 2023 Vadim Romanuke, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

Published Online: Aug 17, 2023

Page range: 1 - 12

DOI: https://doi.org/10.2478/acss-2023-0001

KeywordsFlat objects, initial centroids, k-means, large dataset, speedup

© 2023 Vadim Romanuke, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Flat objects, initial centroids, k-means, large dataset, speedup