Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3495724.3496580guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article
Free access

BanditPAM: almost linear time k-medoids clustering via multi-armed bandits

Published: 06 December 2020 Publication History

Abstract

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering, k-medoids clustering requires the cluster centers to be actual data points and supports arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art k-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size n for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from O(n2) to O(n log n) and returns the same results with high probability, under assumptions on the data that often hold in practice. As such, BanditPAM matches state-of-the-art clustering loss while reaching solutions much faster. We empirically validate our results on several large real-world datasets, including a coding exercise submissions dataset from Code.org, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. In these experiments, we observe that BanditPAM returns the same results as state-of-the-art PAM-like algorithms up to 4x faster while performing up to 200x fewer distance computations. The improvements demonstrated by BanditPAM enable k-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release highly optimized Python and C++ implementations of our algorithm.

Supplementary Material

Additional material (3495724.3496580_supp.pdf)
Supplemental material.

References

[1]
Yasin Abbasi-Yadkori, Peter Bartlett, Victor Gabillon, Alan Malek, and Michal Valko. Best of both worlds: Stochastic and adversarial best-arm identification. In International Conference on Learning Theory, volume 75, pages 918-949, 2018.
[2]
Amin Aghaee, Mehrdad Ghadiri, and Mahdieh Soleymani Baghshah. Active distance-based clustering using k-medoids. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 9651:253-264, 2016.
[3]
Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In International Conference on Learning Theory, pages 41-53, 2010.
[4]
Vivek Bagaria, Tavor Z. Baharav, Govinda M. Kamath, and David Tse. Bandit-based monte carlo optimization for nearest neighbors. In Advances in Neural Information Processing Systems, pages 3650-3659, 2019.
[5]
Vivek Bagaria, Govinda M. Kamath, Vasilis Ntranos, Martin J. Zhang, and David Tse. Medoids in almost-linear time via multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 500-509, 2018.
[6]
Vivek Bagaria, Govinda M. Kamath, and David Tse. Adaptive monte-carlo optimization. arXiv:1805.08321, 2018.
[7]
Tavor Z. Baharav and David Tse. Ultra fast medoid identification via correlated sequential halving. In Advances in Neural Information Processing Systems, pages 3650-3659, 2019.
[8]
Paul S. Bradley, Olvi L. Mangasarian, and W. N. Street. Clustering via concave minimization. In Advances in Neural Information Processing Systems, pages 368-374, 1997.
[9]
Donald Cameron and Ian G. Jones. John snow, the broad street pump and modern epidemiology. In International Journal of Epidemiology, volume 12, page 393-396, 1983.
[10]
Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus. An adaptive sampling algorithm for solving markov decision processes. Operations Research, 53(1):126-139, 2005.
[11]
Code.org. Research at code.org. In https://code.org/research, 2013.
[12]
Vladimir Estivill-Castro and Michael E. Houle. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216-242, 2001.
[13]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255-270, 2002.
[14]
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. In Journal of Machine Learning Research, volume 7, pages 1079-1105, 2006.
[15]
Kent Hymel, Kenneth Small, and Kurt Van Dender. Induced demand and rebound effects in road transport. In Transportation Research B: Methodological, volume 44, page 1220-1241, 2010.
[16]
Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, 1988.
[17]
Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil'ucb: An optimal exploration algorithm for multi-armed bandits. In International Conference on Learning Theory, pages 423-439, 2014.
[18]
Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Annual Conference on Information Sciences and Systems, pages 1-6, 2014.
[19]
Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperpa-rameter optimization. In Artificial Intelligence and Statistics, volume 51, pages 240-248, 2015.
[20]
Leonard Kaufman and Peter J. Rousseeuw. Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm and Related Methods, pages 405-416, 1987.
[21]
Leonard Kaufman and Peter J. Rousseeuw. Partitioning around medoids (program pam). Finding groups in data: an introduction to cluster analysis, pages 68-125, 1990.
[22]
Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European Conference on Machine Learning, pages 282-293. Springer, 2006.
[23]
Branislav Kveton, Csaba Szepesvári, and Mohammad Ghavamzadeh. Perturbed-history exploration in stochastic multi-armed bandits. In International Joint Conference on Artificial Intelligence, 2019.
[24]
Branislav Kveton, Csaba Szepesvári, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, volume 97, page 3601-3610, 2019.
[25]
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985.
[26]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
[27]
Daniel LeJeune, Richard G. Baraniuk, and Reinhard Heckel. Adaptive estimation for approximate k-nearest-neighbor computations. In International Conference on Artificial Intelligence and Statistics, 2019.
[28]
Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of massive data sets. Cambridge university press, 2020.
[29]
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-perband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(1):1-52, 2017.
[30]
Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129-137, 1982.
[31]
Malte D. Luecken and Fabian J. Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular Systems Biology, 15(6):e8746, 2019.
[32]
James MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, 1967.
[33]
Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E. Tarjan. Clustering social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 56-67. Springer, 2007.
[34]
Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys, 33(1):31-88, 2001.
[35]
James Newling and François Fleuret. K-medoids for k-means seeding. In Advances in Neural Information Processing Systems, pages 5195-5203, 2017.
[36]
James Newling and François Fleuret. A sub-quadratic exact medoid algorithm. In International Conference on Artificial Intelligence and Statistics, 2017.
[37]
Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003-1016, 2002.
[38]
Vasilis Ntranos, Govinda M. Kamath, Jesse M. Zhang, Lior Pachter, and David Tse. Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. Genome biology, 17(1):1-14, 2016.
[39]
Peter O. Olukanmi, Fulufhelo Nelwamondo, and Tshilidzi Marwala. Pam-lite: fast and accurate k-medoids clustering for massive datasets. Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa, pages 200-204, 2019.
[40]
Michael L Overton. A quadratically convergent method for minimizing a sum of euclidean norms. Mathematical Programming, 27(1):34-63, 1983.
[41]
Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336-3341, 2009.
[42]
Alan P. Reynolds, Graeme Richards, Beatriz de la Iglesia, and Victor J. Rayward-Smith. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5(4):475-504, 2006.
[43]
Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In International Conference on Similarity Search and Applications, pages 171-187. Springer, 2019.
[44]
Erich Schubert and Arthur Zimek. Elki: A large open-source library for data analysis-elki release 0.7.5 "heidelberg". arXiv:1902.03616, 2019.
[45]
Hwanjun Song, Jae-Gil Lee, and Wook-shin Han. Pamae: Parallel k-medoids clustering with high accuracy and efficiency. In Proc. 23 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 1, pages 281-297, 1967.
[46]
Chi-Hua Wang, Yang Yu, Botao Hao, and Guang Cheng. Residual bootstrap exploration for bandit algorithms. arXiv:2002.08436, 2020.
[47]
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18(6):1245-1262, 1989.
[48]
Martin Zhang, James Zou, and David Tse. Adaptive monte carlo multiple testing via multi-armed bandits. In International Conference on Machine Learning, pages 7512-7522, 2019.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems
December 2020
22651 pages
ISBN:9781713829546

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2020

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 57
    Total Downloads
  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)10
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media