Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Local Search Yields a PTAS for $k$-Means in Doubling Metrics

Published: 01 January 2019 Publication History

Abstract

The most well-known and ubiquitous clustering problem encountered in nearly every branch of science is undoubtedly $k$-means: given a set of data points and a parameter $k$, select $k$ centers and partition the data points into $k$ clusters around these centers so that the sum of squares of distances of the points to their cluster center is minimized. Typically these data points lie in Euclidean space $\mathbb{R}^d$ for some $d\geq 2$. $k$-means and the first algorithms for it were introduced in the 1950s. Over the last six decades, hundreds of papers have studied this problem and different algorithms have been proposed for it. The most commonly used algorithm in practice is known as Lloyd--Forgy, which is also referred to as “the” $k$-means algorithm, and various extensions of it often work very well in practice. However, they may produce solutions whose cost is arbitrarily large compared to the optimum solution. Kanungo et al. [Comput. Geom., 28 (2004), pp. 89--112] analyzed a very simple local search heuristic to get a polynomial-time algorithm with approximation ratio $9+\epsilon$ for any fixed $\epsilon>0$ for $k$-means in Euclidean space. Finding an algorithm with a better worst-case approximation guarantee has remained one of the biggest open questions in this area, in particular, whether one can get a true polynomial-time approximation scheme (PTAS) for fixed dimension Euclidean space. We settle this problem by showing that a simple local search algorithm provides a PTAS for $k$-means for $\mathbb{R}^d$ for any fixed $d$. More precisely, for any error parameter $\epsilon>0$, the local search algorithm that considers swaps of up to $\rho=d^{O(d)}\cdot{\epsilon}^{-O(d/\epsilon)}$ centers at a time will produce a solution using exactly $k$ centers whose cost is at most a $(1+\epsilon)$-factor greater than the optimum solution. Although the algorithm is not practical due to the large polynomial running time, it settles the approximability of this important problem. Our analysis extends very easily to the more general settings where we want to minimize the sum of $q$th powers of the distances between data points and their cluster centers (instead of sum of squares of distances as in $k$-means) for any fixed $q\geq 1$ and where the metric may not be Euclidean but still has fixed doubling dimension. Finally, our techniques also extend to other classic clustering problems. We provide the first demonstration that local search yields a PTAS for uncapacitated facility location and the generalization of $k$-median to the setting with nonuniform opening costs in doubling metrics.

References

[1]
D. Aloise, A. Deshpande, P. Hansen, and P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn., 75 (2009), pp. 245--248.
[2]
S. Arora, Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems, J. ACM, 45 (1998), pp. 753--782.
[3]
S. Arora, P. Raghavan, and S. Rao, Approximation schemes for Euclidean K-medians and related problems, in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (STOC '98), ACM, New York, 1998, pp. 106--113.
[4]
D. Arthur, B. Manthey, and H. Röglin, Smoothed analysis of the K-means method, J. ACM, 58 (2011), 19.
[5]
D. Arthur and S. Vassilvitskii, How slow is the K-means method?, in Proceedings of the Twenty-second Annual Symposium on Computational Geometry (SoCG '06), ACM, New York, 2006, pp. 144--153.
[6]
D. Arthur and S. Vassilvitskii, K-means++: The advantages of careful seeding, in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), SIAM, Philadelphia, 2007, pp. 1027--1035.
[7]
V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit, Local search heuristic for K-median and facility location problems, in Proceedings of the Thirty-third Annual ACM Symposium on Theory of Computing (STOC '01), ACM, New York, 2001, pp. 21--29.
[8]
V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit, Local search heuristics for K-median and facility location problems, SIAM J. Comput., 33 (2004), pp. 544--562.
[9]
P. Awasthi, A. Blum, and O. Sheffet, Stability yields a PTAS for K-median and K-means clustering, in Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS '10), IEEE Computer Society, Los Alamitos, CA, 2010, pp. 309--318.
[10]
P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop, The hardness of approximation of Euclidean k-means, in Proceedings of 31st International Symposium on Computational Geometry (SoCG '15), LIPIcs Leibniz Int. Proc. Inform., 2015, pp. 754--767.
[11]
M. Bādoiu, S. Har-Peled, and P. Indyk, Approximate clustering via Coresets, in Proceedings of the Thirty-fourth Annual ACM Symposium on Theory of Computing (STOC '02), ACM, New York, 2002, pp. 250--257.
[12]
S. Bandyapadhyay and K. Varadarajan, On variants of K-means clustering, in Proceedings of the 32nd International Symposium on Computational Geometry (SoCG '16), LIPIcs Leibniz Int. Proc. Inform., Schloss Dagstuhl-Leibniz-Zentron für Informatik, Dagstuhl, Germany, 2016.
[13]
J. Blömer, C. Lammersen, M. Schmidt, and C. Sohler, Theoretical analysis of the K-means algorithm - A survey, in Algorithm Engineering, Springer, Cham, Switzerland, pp. 81--116.
[14]
J. Byrka, T. Pensyl, B. Rybicki, A. Srinivasan, and K. Trinh, An improved approximation for K-median, and positive correlation in budgeted optimization, ACM Trans. Algorithms, 13 (2017), 23.
[15]
M. Charikar and S. Li, A dependent lp-rounding approach for the k-median problem, in Proceedings of the 39th International Colloquium Conference on Automata, Languages, and Programming - Volume Part I, ICALP'12, Springer, Berlin, 2012, pp. 194--205.
[16]
K. Chen, On coresets for K-median and K-means clustering in metric and Euclidean spaces and their applications, SIAM J. Comput., 39 (2009), pp. 923--947.
[17]
V. Cohen-Addad, P. N. Klein, and C. Mathieu, Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics, in IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, New Brunswick, NJ, IEEE Piscataway, NJ, 2016, pp. 353--364.
[18]
V. Cohen-Addad, P. N. Klein, and C. Mathieu, The power of local search for clustering, preprint, CoRR, abs/1603.09535, 2016.
[19]
V. Cohen-Addad and C. Mathieu, Effectiveness of local search for geometric optimization, in Proceedings of the 31st International Symposium on Computational Geometry (SoCG '15), LIPIcs Leibniz Int. Proc. Inform., Schloss Dagstuhl-Leibniz-Zentron für Informatik, Dagstuhl, Germany, 2015, pp. 329--344.
[20]
S. Dasgupta, How fast is K-means?, in Proceedings of the 16th Annual Conference on Learning Theory and 7th Kernel Workshop (COLT/Kernel '03), Springer, Berlin, 2003, p. 735.
[21]
W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani, Approximation schemes for clustering problems, in Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing (STOC '03), ACM, New York, 2003, pp. 50--58.
[22]
P. D'haeseleer, How does gene expression clustering work?, Nature Biotechnology, 23 (2005), pp. 1499--1502.
[23]
P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, Clustering large graphs via the singular value decomposition, Mach. Learn., 56 (2004), pp. 9--33.
[24]
J. Fakcharoenphol, S. Rao, and K. Talwar, A tight bound on approximating arbitrary metrics by tree metrics, in Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing (STOC '03), ACM, New York, 2003, pp. 448--455.
[25]
D. Feldman, M. Monemizadeh, and C. Sohler, A PTAS for K-means clustering based on weak coresets, in Proceedings of the Twenty-third Annual Symposium on Computational Geometry (SoCG '07), SoCG '07, ACM, New York, 2007, pp. 11--18.
[26]
D. Feldman, M. Schmidt, and C. Sohler, Turning big data into tiny data: Constant-size coresets for K-means, PCA and projective clustering, in Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '13), SIAM, Philadelphia, 2013, pp. 1434--1453.
[27]
Z. Friggstad, M. Rezapour, and M. R. Salavatipour, Local search yields a PTAS for k-means in doubling metrics, in IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 2016, New Brunswick, NJ, IEEE, Piscataway, NJ, 2016, pp. 365--374.
[28]
Z. Friggstad, M. Rezapour, and M. R. Salavatipour, Local search yields a PTAS for k-means in doubling metrics, CoRR, abs/1603.08976, 2016.
[29]
A. Gupta and T. Tangwongsan, Simpler analyses of local search algorithms for facility location, preprint, arXiv:0809.2554, 2008.
[30]
S. Har-Peled and A. Kushal, Smaller coresets for K-median and K-means clustering, in Proceedings of the Twenty-first Annual Symposium on Computational Geometry (SoCG '05), ACM, New York, 2005, pp. 126--134.
[31]
S. Har-Peled and S. Mazumdar, On coresets for K-means and K-median clustering, in Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (STOC '04), ACM, New York, 2004, pp. 291--300.
[32]
S. Har-Peled and B. Sadri, How fast is the K-means method?, Algorithmica, 41 (2005), pp. 185--202.
[33]
M. Hofree, J. P. Shen, H. Carter, A. Gross, and T. Ideker, Network-based stratification of tumor mutations, Nature Methods, 10 (2013), pp. 1108--1115.
[34]
M. Inaba, N. Katoh, and H. Imai, Applications of weighted Voronoi diagrams and randomization to variance-based K-clustering, in Proceedings of the Tenth Annual Symposium on Computational Geometry (SoCG '94), ACM, New York, 1994, pp. 332--339.
[35]
A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recogn. Lett., 31 (2010), pp. 651--666.
[36]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, A local search approximation algorithm for K-means clustering, Comput. Geom., 28 (2004), pp. 89--112.
[37]
S. G. Kolliopoulos and S. Rao, A nearly linear-time approximation scheme for the Euclidean kappa-median problem, in Proceedings of the 7th Annual European Symposium on Algorithms (ESA '99), Springer, Berlin, 1999, pp. 378--389.
[38]
A. Kumar and R. Kannan, Clustering with spectral norm and the K-means algorithm, in Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS '10), IEEE Computer Society, Los Alamitos, CA, 2010, pp. 299--308.
[39]
A. Kumar, Y. Sabharwal, and S. Sen, A simple linear time $(1+ \epsilon)$-approximation algorithm for K-means clustering in any dimensions, in Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS '04), IEEE Computer Society, Los Alamitos, CA, 2004, pp. 454--462.
[40]
A. Kumar, Y. Sabharwal, and S. Sen, Linear-time approximation schemes for clustering problems in any dimensions, J. ACM, 57 (2010), 5.
[41]
M. Langberg and L. J. Schulman, Universal $\epsilon$-approximators for integrals, in Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '10), SIAM, Philadelphia, 2010, pp. 598--607.
[42]
S. Li, A 1.488-approximation for the uncapacitated facility location problem, in Proceedings of the 38th Annual International Colloquium on Automata, Languages and Programming (ICALP '11), Springer, New York, 2011, pp. 45--58.
[43]
S. Li and O. Svensson, Approximating K-median via pseudo-approximation, in Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing (STOC '13), ACM, New York, 2013, pp. 901--910.
[44]
S. Lloyd, Least squares quantization in PCM, IEEE Trans. Inform. Theory, 28 (1982), pp. 129--137.
[45]
M. Mahajan, P. Nimbhorkar, and K. Varadarajan, The planar K-means problem is NP-hard, in Proceedings of the 3rd International Workshop on Algorithms and Computation (WALCOM '09), Springer, Berlin, 2009, pp. 274--285.
[46]
J. Matoušek, On approximate geometric k-clustering, Discrete Comput. Geom., 24 (2000), pp. 61--84.
[47]
R. Ostrovsky and Y. Rabani, Polynomial-time approximation schemes for geometric min-sum median clustering, J. ACM, 49 (2002), pp. 139--156.
[48]
R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy, The effectiveness of Lloyd-type methods for the K-means problem, J. ACM, 59 (2013), 28.
[49]
K. Talwar, Bypassing the embedding: Algorithms for low dimensional metrics, in Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (STOC '04), ACM, New York, 2004, pp. 281--290.
[50]
A. Vattani, The hardness of K-means clustering in the plane, manuscript.
[51]
A. Vattani, K-means requires exponentially many iterations even in the plane, Discrete Comput. Geom., 45 (2011), pp. 596--616.

Cited By

View all
  • (2024)Optimization Research of K-Means Clustering Algorithm Based on Big DataProceedings of the 5th International Conference on Computer Information and Big Data Applications10.1145/3671151.3671329(1021-1025)Online publication date: 26-Apr-2024
  • (2024)Approximation Algorithms for Robust Clustering Problems Using Local Search TechniquesTheory and Applications of Models of Computation10.1007/978-981-97-2340-9_17(197-208)Online publication date: 13-May-2024
  • (2023)Linear time algorithms for k-means with multi-swap local searchProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668100(45651-45680)Online publication date: 10-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Computing
SIAM Journal on Computing  Volume 48, Issue 2
DOI:10.1137/smjcat.48.2
Issue’s Table of Contents

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2019

Author Tags

  1. $k$-means
  2. polynomial time approximation scheme
  3. doubling metrics

Author Tag

  1. 68W25

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimization Research of K-Means Clustering Algorithm Based on Big DataProceedings of the 5th International Conference on Computer Information and Big Data Applications10.1145/3671151.3671329(1021-1025)Online publication date: 26-Apr-2024
  • (2024)Approximation Algorithms for Robust Clustering Problems Using Local Search TechniquesTheory and Applications of Models of Computation10.1007/978-981-97-2340-9_17(197-208)Online publication date: 13-May-2024
  • (2023)Linear time algorithms for k-means with multi-swap local searchProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668100(45651-45680)Online publication date: 10-Dec-2023
  • (2023)Multi-swap k-means++Proceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667257(26069-26091)Online publication date: 10-Dec-2023
  • (2023)LSDS++Proceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618794(9640-9649)Online publication date: 23-Jul-2023
  • (2023)Clustering what mattersProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25818(6666-6674)Online publication date: 7-Feb-2023
  • (2023)Lossy Kernelization of Same-Size ClusteringTheory of Computing Systems10.1007/s00224-023-10129-967:4(785-824)Online publication date: 10-Jul-2023
  • (2022)On the Geometric Set Multicover ProblemDiscrete & Computational Geometry10.1007/s00454-022-00402-y68:2(566-591)Online publication date: 1-Sep-2022
  • (2022)Lossy Kernelization of Same-Size ClusteringComputer Science – Theory and Applications10.1007/978-3-031-09574-0_7(96-114)Online publication date: 29-Jun-2022
  • (2021)Coresets for clustering with missing valuesProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541589(17360-17372)Online publication date: 6-Dec-2021
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media