research-article

Free access

BanditPAM: almost linear time k-medoids clustering via multi-armed bandits

AUTHORs:

Martin Jinye Zhang,

Sebastian Thrun,

Ilan ShomoronyAuthors Info & Claims

NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems

Article No.: 856, Pages 10211 - 10222

Published: 06 December 2020 Publication History

PDF eReader Publisher Site

Abstract

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering, k-medoids clustering requires the cluster centers to be actual data points and supports arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art k-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size n for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from O(n²) to O(n log n) and returns the same results with high probability, under assumptions on the data that often hold in practice. As such, BanditPAM matches state-of-the-art clustering loss while reaching solutions much faster. We empirically validate our results on several large real-world datasets, including a coding exercise submissions dataset from Code.org, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. In these experiments, we observe that BanditPAM returns the same results as state-of-the-art PAM-like algorithms up to 4x faster while performing up to 200x fewer distance computations. The improvements demonstrated by BanditPAM enable k-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release highly optimized Python and C++ implementations of our algorithm.

Supplementary Material

Additional material (3495724.3496580_supp.pdf)

Supplemental material.

Download
672.16 KB

References

[1]

Yasin Abbasi-Yadkori, Peter Bartlett, Victor Gabillon, Alan Malek, and Michal Valko. Best of both worlds: Stochastic and adversarial best-arm identification. In International Conference on Learning Theory, volume 75, pages 918-949, 2018.

[2]

Amin Aghaee, Mehrdad Ghadiri, and Mahdieh Soleymani Baghshah. Active distance-based clustering using k-medoids. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 9651:253-264, 2016.

[3]

Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In International Conference on Learning Theory, pages 41-53, 2010.

[4]

Vivek Bagaria, Tavor Z. Baharav, Govinda M. Kamath, and David Tse. Bandit-based monte carlo optimization for nearest neighbors. In Advances in Neural Information Processing Systems, pages 3650-3659, 2019.

[5]

Vivek Bagaria, Govinda M. Kamath, Vasilis Ntranos, Martin J. Zhang, and David Tse. Medoids in almost-linear time via multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 500-509, 2018.

[6]

Vivek Bagaria, Govinda M. Kamath, and David Tse. Adaptive monte-carlo optimization. arXiv:1805.08321, 2018.

[7]

Tavor Z. Baharav and David Tse. Ultra fast medoid identification via correlated sequential halving. In Advances in Neural Information Processing Systems, pages 3650-3659, 2019.

[8]

Paul S. Bradley, Olvi L. Mangasarian, and W. N. Street. Clustering via concave minimization. In Advances in Neural Information Processing Systems, pages 368-374, 1997.

Digital Library

[9]

Donald Cameron and Ian G. Jones. John snow, the broad street pump and modern epidemiology. In International Journal of Epidemiology, volume 12, page 393-396, 1983.

[10]

Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus. An adaptive sampling algorithm for solving markov decision processes. Operations Research, 53(1):126-139, 2005.

Digital Library

[11]

Code.org. Research at code.org. In https://code.org/research, 2013.

[12]

Vladimir Estivill-Castro and Michael E. Houle. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216-242, 2001.

Digital Library

[13]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255-270, 2002.

Digital Library

[14]

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. In Journal of Machine Learning Research, volume 7, pages 1079-1105, 2006.

[15]

Kent Hymel, Kenneth Small, and Kurt Van Dender. Induced demand and rebound effects in road transport. In Transportation Research B: Methodological, volume 44, page 1220-1241, 2010.

[16]

Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, 1988.

Digital Library

[17]

Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil'ucb: An optimal exploration algorithm for multi-armed bandits. In International Conference on Learning Theory, pages 423-439, 2014.

[18]

Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Annual Conference on Information Sciences and Systems, pages 1-6, 2014.

[19]

Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperpa-rameter optimization. In Artificial Intelligence and Statistics, volume 51, pages 240-248, 2015.

[20]

Leonard Kaufman and Peter J. Rousseeuw. Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm and Related Methods, pages 405-416, 1987.

[21]

Leonard Kaufman and Peter J. Rousseeuw. Partitioning around medoids (program pam). Finding groups in data: an introduction to cluster analysis, pages 68-125, 1990.

[22]

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European Conference on Machine Learning, pages 282-293. Springer, 2006.

Digital Library

[23]

Branislav Kveton, Csaba Szepesvári, and Mohammad Ghavamzadeh. Perturbed-history exploration in stochastic multi-armed bandits. In International Joint Conference on Artificial Intelligence, 2019.

[24]

Branislav Kveton, Csaba Szepesvári, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, volume 97, page 3601-3610, 2019.

[25]

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985.

Digital Library

[26]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.

[27]

Daniel LeJeune, Richard G. Baraniuk, and Reinhard Heckel. Adaptive estimation for approximate k-nearest-neighbor computations. In International Conference on Artificial Intelligence and Statistics, 2019.

[28]

Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of massive data sets. Cambridge university press, 2020.

[29]

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-perband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(1):1-52, 2017.

[30]

Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129-137, 1982.

Digital Library

[31]

Malte D. Luecken and Fabian J. Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular Systems Biology, 15(6):e8746, 2019.

[32]

James MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, 1967.

[33]

Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E. Tarjan. Clustering social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 56-67. Springer, 2007.

[34]

Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys, 33(1):31-88, 2001.

[35]

James Newling and François Fleuret. K-medoids for k-means seeding. In Advances in Neural Information Processing Systems, pages 5195-5203, 2017.

[36]

James Newling and François Fleuret. A sub-quadratic exact medoid algorithm. In International Conference on Artificial Intelligence and Statistics, 2017.

[37]

Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003-1016, 2002.

[38]

Vasilis Ntranos, Govinda M. Kamath, Jesse M. Zhang, Lior Pachter, and David Tse. Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. Genome biology, 17(1):1-14, 2016.

[39]

Peter O. Olukanmi, Fulufhelo Nelwamondo, and Tshilidzi Marwala. Pam-lite: fast and accurate k-medoids clustering for massive datasets. Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa, pages 200-204, 2019.

[40]

Michael L Overton. A quadratically convergent method for minimizing a sum of euclidean norms. Mathematical Programming, 27(1):34-63, 1983.

Digital Library

[41]

Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336-3341, 2009.

[42]

Alan P. Reynolds, Graeme Richards, Beatriz de la Iglesia, and Victor J. Rayward-Smith. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5(4):475-504, 2006.

[43]

Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In International Conference on Similarity Search and Applications, pages 171-187. Springer, 2019.

Digital Library

[44]

Erich Schubert and Arthur Zimek. Elki: A large open-source library for data analysis-elki release 0.7.5 "heidelberg". arXiv:1902.03616, 2019.

[45]

Hwanjun Song, Jae-Gil Lee, and Wook-shin Han. Pamae: Parallel k-medoids clustering with high accuracy and efficiency. In Proc. 23 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 1, pages 281-297, 1967.

[46]

Chi-Hua Wang, Yang Yu, Botao Hao, and Guang Cheng. Residual bootstrap exploration for bandit algorithms. arXiv:2002.08436, 2020.

[47]

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18(6):1245-1262, 1989.

Digital Library

[48]

Martin Zhang, James Zou, and David Tse. Adaptive monte carlo multiple testing via multi-armed bandits. In International Conference on Machine Learning, pages 7512-7522, 2019.

Recommendations

BanditPAM++: faster k-medoids clustering
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Clustering is a fundamental task in data science with wide-ranging applications. In k-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the ...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems

December 2020

22651 pages

ISBN:9781713829546

Editors:
H. Larochelle
Google Research
,
M. Ranzato
Facebook AI Research
,
R. Hadsell
DeepMind
,
M.F. Balcan
Carnegie Mellon University
,
H. Lin
National Taiwan University

Copyright © 2020 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2020

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
57
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)10

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten