Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3534678.3542600acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract

Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking

Published: 14 August 2022 Publication History

Abstract

Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. During this tutorial, state-of-the-art algorithms and the associated core research threads will be presented by identifying different categories based on distance, density grids and hidden statistical models. Clustering validity indices, an important part of the clustering process which are usually neglected or replaced with classification metrics, resulting in misleading interpretation of final results, will also be deeply investigated.
Then, this introduction will be put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. It is also the first open-source project to include an online clustering module that can facilitate reproducibility and allow direct further improvements. From this, we propose methods of clustering configuration, applications and settings for benchmarking, using real-world problems and datasets.

References

[1]
Marcel R. Ackermann, Marcus Martens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. 2012. StreamKM: A Clustering Algorithm for Data Streams. ACM J. Exp. Algorithmics, Vol. 17, Article 2.4 (May 2012), 30 pages. https://doi.org/10.1145/2133803.2184450
[2]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Phillip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29 (Berlin, Germany) (VLDB '03). VLDB Endowment, Berlin, Germany, 81--92. https://doi.org/10.5555/1315451.1315460
[3]
Amineh Amini, Teh Ying Wah, and Hadi Saboohi. 2014. On Density-Based Data Streams Clustering Algorithms: A Survey. Journal of Computer Science and Technology, Vol. 29, 1 (Jan. 2014), 116--141. https://doi.org/10.1007/s11390-013--1416--3
[4]
Albert Bifet, Ricard Gavaldà, Geoff Holmes, and Bernhard Pfahringer. 2018. Machine Learning for Data Streams: with Practical Examples in MOA. The MIT Press, Cambridge, MA, USA. https://doi.org/10.7551/mitpress/10654.001.0001
[5]
Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou. 2006. Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, USA, 328--339. https://doi.org/10.1137/1.9781611972764.29
[6]
Matthias Carnein, Assenmacher Dennis, and Heike Trautmann. 2017. An Empirical Comparison of Stream Clustering Algorithms. In Proceedings of the Computing Frontiers Conference (CF'17). Association for Computing Machinery, New York, NY, USA, 361--366. https://doi.org/10.1145/3075564.3078887
[7]
Matthias Carnein and Heike Trautmann. 2019. Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms. Business & Information Systems Engineering, Vol. 61 (2019), 277--297. https://doi.org/10.1007/s12599-019-00576-5
[8]
Yixin Chen and Li Tu. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKKDD internaional conference on Knowledge discovery and data mining (KDD '07). Association for Computing Machinery, New York, NY, USA, 133--142. https://doi.org/10.1145/1281192.1281210
[9]
Mohammed Ghesmoune, Mustapha Lebbah, and Hanene Azzag. 2016. State-of-the-art on clustering data streams. Big Data Analytics, Vol. 1, 1 (2016), 13. https://doi.org/10.1186/s41044-016-0011-3
[10]
Max Halford, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, and Adil Zouitine. 2019. creme, a Python library for online machine learning. https://github.com/MaxHalford/creme
[11]
Michael Hashler and Matthew Bolaños. 2016. Clustering Data Streams Based on Shared Density between Micro-Clusters. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 6 (2016), 1449--1461. https://doi.org/10.1109/TKDE.2016.2522412
[12]
Ali Javed, Byung Suk Lee, and Donna M. Rizzo. 2020. A benchmark study on time series clustering. Machine Learning with Applications, Vol. 1 (Sept. 2020), 100001. https://doi.org/10.1016/j.mlwa.2020.100001
[13]
Stratos Mansalis, Eirini Ntoutsi, Nikos Pelekis, and Yannis Theodoridis. 2018. An evaluation of data stream clustering algorithm. Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 11 (2018), 167--187. https://doi.org/10.1002/sam.11380
[14]
Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. 2021. River: machine learning for streaming data in Python. Journal of Machine Learning Research, Vol. 22 (April 2021), 1--8. http://jmlr.org/papers/v22/20-1380.html
[15]
Jacob Montiel, Jesse Read, Albert Bifet, and Talel Abdessalem. 2018. Scikit-Multiflow: A Multi-output Streaming Framework. Journal of Machine Learning Research, Vol. 19, 72 (2018), 1--5. http://jmlr.org/papers/v19/18--251.html
[16]
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. 2002. Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering. IEEE, San Jose, CA, USA, 685--694. https://doi.org/10.1109/ICDE.2002.994785
[17]
Leonardo Enzo Brito Da Silva, Niklas Max Melton, and Donald C. Wunsch. 2020. Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study. IEEE Access, Vol. 8 (Jan. 2020), 22025--22047. https://doi.org/10.1109/ACCESS.2020.2969849

Cited By

View all
  • (2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
  • (2023)Toward Efficient and Incremental Spectral Clustering via Parametric Spectral Clustering2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386143(1070-1075)Online publication date: 15-Dec-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. benchmarking
  2. data streams
  3. decision support
  4. online clustering
  5. stream clustering
  6. stream learning

Qualifiers

  • Abstract

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)200
  • Downloads (Last 6 weeks)19
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
  • (2023)Toward Efficient and Incremental Spectral Clustering via Parametric Spectral Clustering2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386143(1070-1075)Online publication date: 15-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media