Abstract
The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r -gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O 1, ..., O k are an arbitrary partition of the dataset and the goal is to output k-centers c 1, ..., c k such that the objective function \({\sum }_{i = 1}^{k} {\sum }_{x \in O_{i}} ||x - c_{i}||^{2}\) is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter ε > 0, let ℓ denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + ε) approximation w.r.t. the objective function above. In this paper, we show an upper bound on ℓ by giving a randomized algorithm that outputs a list of \(2^{\tilde {O}(k/\varepsilon )}\) k-centers. We also give a closely matching lower bound of \(2^{\tilde {\Omega }(k/\sqrt {\varepsilon })}\). Moreover, our algorithm runs in time \(O \left (n d \cdot 2^{\tilde {O}(k/\varepsilon )} \right )\). This is a significant improvement over the previous result of Ding and Xu (2015) who gave an algorithm with running time O(n d ⋅ (log n)k ⋅ 2poly(k/ε)) and output a list of size O((log n)k ⋅ 2poly(k/ε)). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.
Similar content being viewed by others
Notes
Ding and Xu [5] also gave a discussion on such partition algorithms for a number of clustering problems with side constraints.
For any real numbers \(a_{1}, ..., a_{m}, ({\sum }_{r} a_{r})^{2}/m \leq {\sum }_{r} {a_{r}^{2}}\).
Please see [9] for a discussion on such distance measures. This work shows how to extend such D 2-sampling based analysis to settings involving such distance measures.
References
Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms 6, 59,1–59,26 (2010)
Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp. 250–257. ACM, New York (2002)
Chen, K.: On k-median clustering in high dimensions. In: Proceedings of the Seventeenth annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, pp. 1177–1185. ACM, New York (2006)
de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing, STOC ’03, pp. 50–58. ACM, New York (2003)
Ding, H., Jinhui, X.: A unified framework for clustering constrained data without locality property. In: Proceedings of the Twenty-sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, pp. 1471–1490 (2015)
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of the Twenty-third Annual Symposium on Computational Geometry, SCG ’07, pp. 11–18. ACM, New York (2007)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC ’04, pp. 291–300. ACM, New York (2004)
Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG ’94, pp. 332–339. ACM, New York (1994)
Jaiswal, R., Kumar, A., Sen, S.: A simple D 2-sampling based PTAS for k-means and other clustering problems. Algorithmica 70(1), 22–46 (2014)
Jaiswal, R., Kumar, M., Yadav, P.: Improved analysis of D 2-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett. 115(2), 100–103 (2015)
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2), 5,1–5,32 (2010)
Matoušek, J.: On approximate geometric k -clustering. Discret. Comput. Geom. 24(1), 61–84 (2000)
Acknowledgments
Ragesh Jaiswal acknowledges the support of ISF-UGC India-Israel joint research grant 2014.
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is part of the Topical Collection on Theoretical Aspects of Computer Science
Õ notation hides a \( O({\mathrm{log}}{\frac {k}{\varepsilon}}) \) factor.
Rights and permissions
About this article
Cite this article
Bhattacharya, A., Jaiswal, R. & Kumar, A. Faster Algorithms for the Constrained k-means Problem. Theory Comput Syst 62, 93–115 (2018). https://doi.org/10.1007/s00224-017-9820-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-017-9820-7