abstract

Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking

Authors:

Minh-Huong Le-Nguyen,

Albert BifetAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4808 - 4809

https://doi.org/10.1145/3534678.3542600

Published: 14 August 2022 Publication History

Abstract

Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. During this tutorial, state-of-the-art algorithms and the associated core research threads will be presented by identifying different categories based on distance, density grids and hidden statistical models. Clustering validity indices, an important part of the clustering process which are usually neglected or replaced with classification metrics, resulting in misleading interpretation of final results, will also be deeply investigated.

Then, this introduction will be put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. It is also the first open-source project to include an online clustering module that can facilitate reproducibility and allow direct further improvements. From this, we propose methods of clustering configuration, applications and settings for benchmarking, using real-world problems and datasets.

References

[1]

Marcel R. Ackermann, Marcus Martens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. 2012. StreamKM: A Clustering Algorithm for Data Streams. ACM J. Exp. Algorithmics, Vol. 17, Article 2.4 (May 2012), 30 pages. https://doi.org/10.1145/2133803.2184450

Digital Library

[2]

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Phillip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29 (Berlin, Germany) (VLDB '03). VLDB Endowment, Berlin, Germany, 81--92. https://doi.org/10.5555/1315451.1315460

Digital Library

[3]

Amineh Amini, Teh Ying Wah, and Hadi Saboohi. 2014. On Density-Based Data Streams Clustering Algorithms: A Survey. Journal of Computer Science and Technology, Vol. 29, 1 (Jan. 2014), 116--141. https://doi.org/10.1007/s11390-013--1416--3

[4]

Albert Bifet, Ricard Gavaldà, Geoff Holmes, and Bernhard Pfahringer. 2018. Machine Learning for Data Streams: with Practical Examples in MOA. The MIT Press, Cambridge, MA, USA. https://doi.org/10.7551/mitpress/10654.001.0001

[5]

Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou. 2006. Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, USA, 328--339. https://doi.org/10.1137/1.9781611972764.29

[6]

Matthias Carnein, Assenmacher Dennis, and Heike Trautmann. 2017. An Empirical Comparison of Stream Clustering Algorithms. In Proceedings of the Computing Frontiers Conference (CF'17). Association for Computing Machinery, New York, NY, USA, 361--366. https://doi.org/10.1145/3075564.3078887

Digital Library

[7]

Matthias Carnein and Heike Trautmann. 2019. Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms. Business & Information Systems Engineering, Vol. 61 (2019), 277--297. https://doi.org/10.1007/s12599-019-00576-5

[8]

Yixin Chen and Li Tu. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKKDD internaional conference on Knowledge discovery and data mining (KDD '07). Association for Computing Machinery, New York, NY, USA, 133--142. https://doi.org/10.1145/1281192.1281210

Digital Library

[9]

Mohammed Ghesmoune, Mustapha Lebbah, and Hanene Azzag. 2016. State-of-the-art on clustering data streams. Big Data Analytics, Vol. 1, 1 (2016), 13. https://doi.org/10.1186/s41044-016-0011-3

[10]

Max Halford, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, and Adil Zouitine. 2019. creme, a Python library for online machine learning. https://github.com/MaxHalford/creme

[11]

Michael Hashler and Matthew Bolaños. 2016. Clustering Data Streams Based on Shared Density between Micro-Clusters. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 6 (2016), 1449--1461. https://doi.org/10.1109/TKDE.2016.2522412

Digital Library

[12]

Ali Javed, Byung Suk Lee, and Donna M. Rizzo. 2020. A benchmark study on time series clustering. Machine Learning with Applications, Vol. 1 (Sept. 2020), 100001. https://doi.org/10.1016/j.mlwa.2020.100001

[13]

Stratos Mansalis, Eirini Ntoutsi, Nikos Pelekis, and Yannis Theodoridis. 2018. An evaluation of data stream clustering algorithm. Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 11 (2018), 167--187. https://doi.org/10.1002/sam.11380

Digital Library

[14]

Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. 2021. River: machine learning for streaming data in Python. Journal of Machine Learning Research, Vol. 22 (April 2021), 1--8. http://jmlr.org/papers/v22/20-1380.html

[15]

Jacob Montiel, Jesse Read, Albert Bifet, and Talel Abdessalem. 2018. Scikit-Multiflow: A Multi-output Streaming Framework. Journal of Machine Learning Research, Vol. 19, 72 (2018), 1--5. http://jmlr.org/papers/v19/18--251.html

[16]

L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. 2002. Streaming-data algorithms for high-quality clustering. In Proceedings 18th International Conference on Data Engineering. IEEE, San Jose, CA, USA, 685--694. https://doi.org/10.1109/ICDE.2002.994785

[17]

Leonardo Enzo Brito Da Silva, Niklas Max Melton, and Donald C. Wunsch. 2020. Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study. IEEE Access, Vol. 8 (Jan. 2020), 22025--22047. https://doi.org/10.1109/ACCESS.2020.2969849

Cited By

Barry MMontiel JBifet AWadkar SManchev NHalford MChiky RJaouhari SShakman KFehaily JLe Deit FTran VGuerizec E(2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00272
Chen JChen H(2023)Toward Efficient and Incremental Spectral Clustering via Parametric Spectral Clustering2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386143(1070-1075)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386143

Index Terms

Online Clustering: Algorithms, Evaluation, Metrics, Applications and Benchmarking
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
    2. Learning settings
      1. Online learning settings
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Data stream mining

Recommendations

An Empirical Comparison of Stream Clustering Algorithms
CF'17: Proceedings of the Computing Frontiers Conference

Analysing streaming data has received considerable attention over the recent years. A key research area in this field is stream clustering which aims to recognize patterns in a possibly unbounded data stream of varying speed and structure. Over the past ...
Adaptive non-linear clustering in data streams
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Data stream clustering has emerged as a challenging and interesting problem over the past few years. Due to the evolving nature, and one-pass restriction imposed by the data stream model, traditional clustering algorithms are inapplicable for stream ...
Efficient clustering of short text streams using online-offline clustering
DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering

Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
820
Total Downloads

Downloads (Last 12 months)200
Downloads (Last 6 weeks)19

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Barry MMontiel JBifet AWadkar SManchev NHalford MChiky RJaouhari SShakman KFehaily JLe Deit FTran VGuerizec E(2023)StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00272(3508-3521)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00272
Chen JChen H(2023)Toward Efficient and Incremental Spectral Clustering via Parametric Spectral Clustering2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386143(1070-1075)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386143

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents