research-article

Clustering Data Streams: Theory and Practice

Authors:

Rajeev Motwani,

Liadan O'CallaghanAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 15, Issue 3

Pages 515 - 528

https://doi.org/10.1109/TKDE.2003.1198387

Published: 01 March 2003 Publication History

Abstract

The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

References

[1]

D. Achlioptas and F. McSherry, “Fast Computation of Low-Rank Approximations,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 611-618, 2001.]]

Digital Library

[2]

P.K. Agarwal and C. Procopiuc, “Approximation Algorithms for Projective Clustering,” <i>Proc. ACM Symp. Discrete Algorithms,</i> pp. 538-547, 2000.]]

Digital Library

[3]

R. Agrawal J.E. Gehrke D. Gunopulos and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” <i>Proc. SIGMOD,</i> 1998.]]

Digital Library

[4]

N. Alon Y. Matias and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 20-29, 1996.]]

Digital Library

[5]

M. Ankerst M. Breunig H. Kriegel and J. Sander, “Optics: Ordering Points to Identify the Clustering Structure,” <i>Proc. SIGMOD,</i> 1999.]]

Digital Library

[6]

S. Arora P. Raghavan and S. Rao, “Approximation Schemes for Euclidean k-Medians and Related Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 106-113, 1998.]]

Digital Library

[7]

V. Arya N. Garg R. Khandekar K. Munagala and V. Pandit, “Local Search Heuristic for k-Median and Facility Location Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 21-29, 2001.]]

Digital Library

[8]

B. Babcock M. Datar and R. Motwani, “Sampling from a Moving Window over Streaming Data,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 2002.]]

Digital Library

[9]

Y. Bartal M. Charikar and D. Raz, “Approximating Min-Sum k-Clustering in Metric Spaces,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2001.]]

Digital Library

[10]

A. Borodin R. Ostrovsky and Y. Rabani, “Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1999.]]

Digital Library

[11]

P.S. Bradley U.M. Fayyad and C. Reina, “Scaling Clustering Algorithms to Large Databases,” <i>Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining,</i> pp. 9-15, 1998.]]

[12]

M. Charikar S. Chaudhuri R. Motwani and V.R. Narasayya, “Towards Estimation Error Guarantees for Distinct Values,” <i>Proc. 19th ACM-SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS),</i> pp. 268-279, 2000.]]

Digital Library

[13]

M. Charikar C. Chekuri T. Feder and R. Motwani, “Incremental Clustering and Dynamic Information Retrieval,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 626-635, 1997.]]

Digital Library

[14]

M. Charikar and S. Guha, “Improved Combinatorial Algorithms for the Facility Location and k-Median Problems,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 378-388, 1999.]]

Digital Library

[15]

M. Charikar S. Guha É. Tardos and D.B. Shmoys, “A Constant Factor Approximation Algorithm for the k-Median Problem,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1999.]]

Digital Library

[16]

F. Chudak, “Improved Approximation Algorithms for Uncapacitated Facility Location,” <i>Proc. Conf. Integer Programming and Combinatorial Optimization,</i> pp. 180-194, 1998.]]

[17]

M. Datar A. Gionis P. Indyk and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 2002.]]

Digital Library

[18]

P. Drineas R. Kannan A. Frieze and V. Vinay, “Clustering in Large Graphs and Matrices,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 1999.]]

Digital Library

[19]

M. Ester H. Kriegel J. Sander and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases,” <i>Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining,</i> pp. 226-231, 1996.]]

[20]

F. Farnstrom J. Lewis and C. Elkan, “True Scalability for Clustering Algorithms,” <i>SIGKDD Explorations,</i> 2000.]]

Digital Library

[21]

T. Feder and D.H. Greene, “Optimal Algorithms for Appropriate Clustering,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 434-444, 1988.]]

Digital Library

[22]

J. Feigenbaum S. Kannan M. Strauss and M. Viswanathan, “An Approximate l1-Difference Algorithm for Massive Data Streams,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 1999.]]

Digital Library

[23]

P. Flajolet and G. Martin, “Probabilistic Counting Algorithms for Data Base Applications,” <i>J. Computer and System Sciences,</i> vol. 31, pp. 182-209, 1985.]]

Digital Library

[24]

A. Frieze R. Kannan and S. Vempala, “Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 1998.]]

Digital Library

[25]

V. Ganti J. Gehrke and R. Ramakrishnan, “DEMON: Mining and Monitoring Evolving Data,” <i>Knowledge and Data Eng.,</i> vol. 13, no. 1, pp. 50-63, 2001.]]

Digital Library

[26]

P. Gibbons and Y. Matias, “Synopsis Data Structures for Massive Data Sets,” <i>Proc. ACM Symp. Discrete Algorithms,</i> pp. S909-S910, 1999.]]

Digital Library

[27]

A. Gilbert S. Guha P. Indyk Y. Kotadis S. Muthukrishnan and M. Strauss, “Fast, Small-Space Algorithms for Approximate Histogram Maintanance,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2002.]]

Digital Library

[28]

A. Gilbert Y. Kotidis S. Muthukrishnan and M. Strauss, “How to Summarize the Universe: Dynamic Maintenance of Quantiles,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> 2002.]]

[29]

A.C. Gilbert S. Guha P. Indyk S. Muthukrishnan and M.J. Strauss, “Near-Optimal Sparse Fourier Representations via Sampling,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2002.]]

Digital Library

[30]

M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” <i>Proc. SIGMOD,</i> 2001.]]

Digital Library

[31]

S. Guha and S. Khuller, “Greedy Strikes Back: Improved Facility Location Algorithms,” <i>Proc. ACM Symp. Discrete Algorithms,</i> pp. 649-657, 1998.]]

Digital Library

[32]

S. Guha and N. Koudas, “Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation,” <i>Proc. Int'l Conf. Data Eng.,</i> 2002.]]

Digital Library

[33]

S. Guha N. Koudas and K. Shim, “Data-Streams and Histograms,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 471-475, 2001.]]

Digital Library

[34]

S. Guha N. Mishra R. Motwani and L. O'Callaghan, “Clustering Data Streams,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 359-366, 2000.]]

Digital Library

[35]

S. Guha R. Rastogi and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” <i>Proc. SIGMOD,</i> pp. 73-84, 1998.]]

Digital Library

[36]

P.J. Haas J.F. Naughton S. Seahadri and L. Stokes, “Sampling-Based Estimation of the Number of Distinct Values of an Attribute,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> pp. 311-322, 1995.]]

Digital Library

[37]

<i>Data Mining: Concepts and Techniques,</i> J. Han and M. Kimber, eds. Morgan Kaufman, 2000.]]

Digital Library

[38]

M. Henzinger P. Raghavan and S. Rajagopalan, “Computing on Data Streams,” Digital Equipment Corp., Technical Report TR-1998-011, Aug. 1998.]]

[39]

A. Hinneburg and D. Keim, “An Efficient Approach to Clustering Large Multimedia Databases with Noise,” <i>Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining,</i> 1998.]]

[40]

A. Hinneburg and D. Keim, “Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> 1999.]]

Digital Library

[41]

D. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for the k-Center Problem,” <i>Math. of Operations Research,</i> vol. 10, no. 2, pp. 180-184, 1985.]]

Digital Library

[42]

P. Indyk, “Sublinear Time Algorithms for Metric Space Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1999.]]

Digital Library

[43]

P. Indyk, “A Sublinear Time Approximation Scheme for Clustering in Metric Spaces,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 154-159, 1999.]]

Digital Library

[44]

P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2000.]]

Digital Library

[45]

A.K. Jain and R.C. Dubes, <i>Algorithms for Clustering Data.</i> Prentice Hall, 1988.]]

Digital Library

[46]

K. Jain M. Mahdian and A. Saberi, “A New Greedy Approach for Facility Location Problem,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2002.]]

Digital Library

[47]

K. Jain and V. Vazirani, “Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 1999.]]

Digital Library

[48]

R. Kannan S. Vempala and A. Vetta, “On Clusterings: Good, Bad and Spectral,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 367-377, 2000.]]

Digital Library

[49]

O. Kariv and S.L. Hakimi, “An Algorithmic Approach to Network Location Problems, Part II: p-Media ns,” <i>SIAM J. Applied Math.,</i> pp. 539-560, 1979.]]

[50]

L. Kaufman and P.J. Rousseeuw, <i>Finding Groups in Data. An Introduction to Cluster Analysis.</i> New York: Wiley, 1990.]]

[51]

L.E. Kavraki J.C. Latombe R. Motwani and P. Raghavan, “Randomized Query Processing in Robot Path Planning,” <i>J. Computer and System Sciences,</i> vol. 57, pp. 50-60, 1998.]]

Digital Library

[52]

S. Kolliopoulos and S. Rao, “A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem,” <i>Proc. Seventh European Symp. Algorithms,</i> pp. 378-389, 1999.]]

Digital Library

[53]

J.H. Lin and J.S. Vitter, “Approximation Algorithms for Geometric Median Problems,” <i>Information Processing Letters,</i> vol. 44, pp. 245-249, 1992.]]

Digital Library

[54]

J.H. Lin and J.S. Vitter, “ε-Approximations with Minimum Packing Constraint Violations,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1992.]]

Digital Library

[55]

O.L. Managasarian, “Mathematical Programming in Data Mining,” <i>Data Mining and Knowledge Discovery,</i> 1997.]]

Digital Library

[56]

G.S. Manku S. Rajagopalan and B. Lindsay, “Approximate Medians and Other Quantiles in One Pass with Limited Memory,” <i>Proc. SIGMOD,</i> 1998.]]

Digital Library

[57]

G.S. Manku S. Rajagopalan and B. Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets,” <i>Proc. SIGMOD,</i> 1999.]]

Digital Library

[58]

D. Marchette, “A Statistical Method for Profiling Network Traffic,” <i>Proc. Workshop Intrusion Detection and Network Monitoring,</i> 1999.]]

Digital Library

[59]

R. Mettu and C.G. Plaxton, “The Onlike Median Problem,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2000.]]

Digital Library

[60]

R. Mettu and C.G. Plaxton, “Optimal Time Bounds for Approximate Clustering,” <i>Proc. Conf. Uncertainty of Artificial Intelligence,</i> 2002.]]

[61]

A. Meyerson, “Online Facility Location,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2001.]]

Digital Library

[62]

<i>Discrete Location Theory,</i> P. Mirchandani and R. Francis, eds. New York: John Wiley and Sons, Inc. 1990.]]

[63]

N. Mishra D. Oblinger and L. Pitt, “Sublinear Time Approximate Clustering,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 2001.]]

Digital Library

[64]

J. Munro and M. Paterson, “Selection and Sorting with Limited Storage,” <i>Theoretical Computer Science,</i> pp. 315-323, 1980.]]

[65]

K. Nauta and F. Lieble, “Offline Network Intrusion Detection: Looking for Footprints,” SAS White Paper, 2000.]]

[66]

R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> pp. 144-155, 1994.]]

Digital Library

[67]

L. O'Callaghan N. Mishra A. Meyerson S. Guha and R. Motwani, “Streaming-Data Algorithms for High-Quality Clustering,” <i>Proc. Int'l Conf. Data Eng.,</i> 2002.]]

Digital Library

[68]

R. Ostrovsky and Y. Rabani, “Polynomial Time Approximation Schemes for Geometric k-Clustering,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2000.]]

Digital Library

[69]

C. Procopiuc M. Jones P.K. Agarwal and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,” <i>Proc. SIGMOD,</i> 2002.]]

Digital Library

[70]

G. Sheikholeslami S. Chatterjee and A. Zhang, “Wavecluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> pp. 428-439, 1998.]]

Digital Library

[71]

D.B. Shmoys É. Tardos and K. Aardal, “Approximation Algorithms for Facility Location Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 265-274, 1997.]]

Digital Library

[72]

N. Thaper S. Guha P. Indyk and N. Koudas, “Dynamic Multidimensional Histograms,” <i>Proc. SIGMOD,</i> 2002.]]

Digital Library

[73]

M. Thorup, “Quick k-Median, k-Center, and Facility Location for Sparse Graphs,” <i>Proc. Int'l Colloquium on Automata, Languages, and Programming,</i> pp. 249-260, 2001.]]

Digital Library

[74]

V. Vazirani, <i>Approximation Algorithms.</i> Springer Verlag, 2001.]]

Digital Library

[75]

W. Wang J. Yang and R. Muntz, “Sting: A Statistical Information Grid Approach to Spatial Data Mining,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> 1997.]]

Digital Library

[76]

T. Zhang R. Ramakrishnan and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” <i>Proc. SIGMOD,</i> pp. 103-114, 1996.]]

Digital Library

Cited By

Chang C(2024)Deterministic metric 1-median selection with very few queriesTheoretical Computer Science10.1016/j.tcs.2024.1147201011:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.tcs.2024.114720
Zhang CTang Z(2024)Novel poisoning attacks for clustering methods via robust feature generationNeurocomputing10.1016/j.neucom.2024.127925598:COnline publication date: 14-Sep-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127925
Erdinç BKaya MŞenol A(2024)MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming dataNeural Computing and Applications10.1007/s00521-024-09443-136:13(7025-7042)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00521-024-09443-1
Show More Cited By

Index Terms

Clustering Data Streams: Theory and Practice

Recommendations

A Framework for Clustering Massive-Domain Data Streams
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

In this paper, we will examine the problem of clustering massive domain data streams. Massive-domain data streams are those in which the number of possible domain values for each attribute are very large and cannot be easily tracked for clustering ...
Requirements for clustering data streams

Scientific and industrial examples of data streams abound in astronomy, telecommunication operations, banking and stock-market applications, e-commerce and other fields. A challenge imposed by continuously arriving data streams is to analyze them and to ...
An evolutionary algorithm for clustering data streams with a variable number of clusters

An evolutionary algorithm for clustering data stream is proposed.Our algorithm allows estimating k automatically from the data in an online fashion.It monitors eventual degradation in the quality of the induced clusters.Results show our algorithm ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 15, Issue 3

March 2003

255 pages

ISSN:1041-4347

Issue’s Table of Contents

Copyright © Copyright © 2003 IEEE. All Rights Reserved.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 March 2003

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

223
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chang C(2024)Deterministic metric 1-median selection with very few queriesTheoretical Computer Science10.1016/j.tcs.2024.1147201011:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.tcs.2024.114720
Zhang CTang Z(2024)Novel poisoning attacks for clustering methods via robust feature generationNeurocomputing10.1016/j.neucom.2024.127925598:COnline publication date: 14-Sep-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127925
Erdinç BKaya MŞenol A(2024)MCMSTStream: applying minimum spanning tree to KD-tree-based micro-clusters to define arbitrary-shaped clusters in streaming dataNeural Computing and Applications10.1007/s00521-024-09443-136:13(7025-7042)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00521-024-09443-1
Zhang TLi DDong JHe YChang Y(2023)Incremental density clustering framework based on dynamic microlocal clustersIntelligent Data Analysis10.3233/IDA-22726327:6(1637-1661)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/IDA-227263
Schneider NSankaranarayanan JSamet H(2023)Cross-lingual Text Clustering in a Large SystemProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639356(1-11)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3639233.3639356
Ding GWang YLi CSun HLi CWang LYin HHuang T(2023)HSCFCFuture Generation Computer Systems10.1016/j.future.2023.04.008146:C(156-165)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.future.2023.04.008
Koudas NLi RXarchakos I(2022)Video Monitoring QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.304860634:10(5023-5036)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/TKDE.2020.3048606
Klassen GTatusch MConrad S(2022)Cluster-based stability evaluation in time series data setsApplied Intelligence10.1007/s10489-022-04231-753:13(16606-16629)Online publication date: 13-Dec-2022
https://dl.acm.org/doi/10.1007/s10489-022-04231-7
Boniol PPaparrizos JPalpanas TFranklin M(2021)SANDProceedings of the VLDB Endowment10.14778/3467861.346786314:10(1717-1729)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467863
Bah MWang HZhao LZhang JXiao J(2021)EMM-CLODSComplexity10.1155/2021/91784612021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/9178461
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents