Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Clustering Data Streams: Theory and Practice

Published: 01 March 2003 Publication History

Abstract

The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

References

[1]
D. Achlioptas and F. McSherry, “Fast Computation of Low-Rank Approximations,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 611-618, 2001.]]
[2]
P.K. Agarwal and C. Procopiuc, “Approximation Algorithms for Projective Clustering,” <i>Proc. ACM Symp. Discrete Algorithms,</i> pp. 538-547, 2000.]]
[3]
R. Agrawal J.E. Gehrke D. Gunopulos and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” <i>Proc. SIGMOD,</i> 1998.]]
[4]
N. Alon Y. Matias and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 20-29, 1996.]]
[5]
M. Ankerst M. Breunig H. Kriegel and J. Sander, “Optics: Ordering Points to Identify the Clustering Structure,” <i>Proc. SIGMOD,</i> 1999.]]
[6]
S. Arora P. Raghavan and S. Rao, “Approximation Schemes for Euclidean k-Medians and Related Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 106-113, 1998.]]
[7]
V. Arya N. Garg R. Khandekar K. Munagala and V. Pandit, “Local Search Heuristic for k-Median and Facility Location Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 21-29, 2001.]]
[8]
B. Babcock M. Datar and R. Motwani, “Sampling from a Moving Window over Streaming Data,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 2002.]]
[9]
Y. Bartal M. Charikar and D. Raz, “Approximating Min-Sum k-Clustering in Metric Spaces,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2001.]]
[10]
A. Borodin R. Ostrovsky and Y. Rabani, “Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1999.]]
[11]
P.S. Bradley U.M. Fayyad and C. Reina, “Scaling Clustering Algorithms to Large Databases,” <i>Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining,</i> pp. 9-15, 1998.]]
[12]
M. Charikar S. Chaudhuri R. Motwani and V.R. Narasayya, “Towards Estimation Error Guarantees for Distinct Values,” <i>Proc. 19th ACM-SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS),</i> pp. 268-279, 2000.]]
[13]
M. Charikar C. Chekuri T. Feder and R. Motwani, “Incremental Clustering and Dynamic Information Retrieval,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 626-635, 1997.]]
[14]
M. Charikar and S. Guha, “Improved Combinatorial Algorithms for the Facility Location and k-Median Problems,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 378-388, 1999.]]
[15]
M. Charikar S. Guha É. Tardos and D.B. Shmoys, “A Constant Factor Approximation Algorithm for the k-Median Problem,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1999.]]
[16]
F. Chudak, “Improved Approximation Algorithms for Uncapacitated Facility Location,” <i>Proc. Conf. Integer Programming and Combinatorial Optimization,</i> pp. 180-194, 1998.]]
[17]
M. Datar A. Gionis P. Indyk and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 2002.]]
[18]
P. Drineas R. Kannan A. Frieze and V. Vinay, “Clustering in Large Graphs and Matrices,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 1999.]]
[19]
M. Ester H. Kriegel J. Sander and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases,” <i>Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining,</i> pp. 226-231, 1996.]]
[20]
F. Farnstrom J. Lewis and C. Elkan, “True Scalability for Clustering Algorithms,” <i>SIGKDD Explorations,</i> 2000.]]
[21]
T. Feder and D.H. Greene, “Optimal Algorithms for Appropriate Clustering,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 434-444, 1988.]]
[22]
J. Feigenbaum S. Kannan M. Strauss and M. Viswanathan, “An Approximate l1-Difference Algorithm for Massive Data Streams,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 1999.]]
[23]
P. Flajolet and G. Martin, “Probabilistic Counting Algorithms for Data Base Applications,” <i>J. Computer and System Sciences,</i> vol. 31, pp. 182-209, 1985.]]
[24]
A. Frieze R. Kannan and S. Vempala, “Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 1998.]]
[25]
V. Ganti J. Gehrke and R. Ramakrishnan, “DEMON: Mining and Monitoring Evolving Data,” <i>Knowledge and Data Eng.,</i> vol. 13, no. 1, pp. 50-63, 2001.]]
[26]
P. Gibbons and Y. Matias, “Synopsis Data Structures for Massive Data Sets,” <i>Proc. ACM Symp. Discrete Algorithms,</i> pp. S909-S910, 1999.]]
[27]
A. Gilbert S. Guha P. Indyk Y. Kotadis S. Muthukrishnan and M. Strauss, “Fast, Small-Space Algorithms for Approximate Histogram Maintanance,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2002.]]
[28]
A. Gilbert Y. Kotidis S. Muthukrishnan and M. Strauss, “How to Summarize the Universe: Dynamic Maintenance of Quantiles,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> 2002.]]
[29]
A.C. Gilbert S. Guha P. Indyk S. Muthukrishnan and M.J. Strauss, “Near-Optimal Sparse Fourier Representations via Sampling,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2002.]]
[30]
M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” <i>Proc. SIGMOD,</i> 2001.]]
[31]
S. Guha and S. Khuller, “Greedy Strikes Back: Improved Facility Location Algorithms,” <i>Proc. ACM Symp. Discrete Algorithms,</i> pp. 649-657, 1998.]]
[32]
S. Guha and N. Koudas, “Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation,” <i>Proc. Int'l Conf. Data Eng.,</i> 2002.]]
[33]
S. Guha N. Koudas and K. Shim, “Data-Streams and Histograms,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 471-475, 2001.]]
[34]
S. Guha N. Mishra R. Motwani and L. O'Callaghan, “Clustering Data Streams,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 359-366, 2000.]]
[35]
S. Guha R. Rastogi and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” <i>Proc. SIGMOD,</i> pp. 73-84, 1998.]]
[36]
P.J. Haas J.F. Naughton S. Seahadri and L. Stokes, “Sampling-Based Estimation of the Number of Distinct Values of an Attribute,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> pp. 311-322, 1995.]]
[37]
<i>Data Mining: Concepts and Techniques,</i> J. Han and M. Kimber, eds. Morgan Kaufman, 2000.]]
[38]
M. Henzinger P. Raghavan and S. Rajagopalan, “Computing on Data Streams,” Digital Equipment Corp., Technical Report TR-1998-011, Aug. 1998.]]
[39]
A. Hinneburg and D. Keim, “An Efficient Approach to Clustering Large Multimedia Databases with Noise,” <i>Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining,</i> 1998.]]
[40]
A. Hinneburg and D. Keim, “Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> 1999.]]
[41]
D. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for the k-Center Problem,” <i>Math. of Operations Research,</i> vol. 10, no. 2, pp. 180-184, 1985.]]
[42]
P. Indyk, “Sublinear Time Algorithms for Metric Space Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1999.]]
[43]
P. Indyk, “A Sublinear Time Approximation Scheme for Clustering in Metric Spaces,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 154-159, 1999.]]
[44]
P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2000.]]
[45]
A.K. Jain and R.C. Dubes, <i>Algorithms for Clustering Data.</i> Prentice Hall, 1988.]]
[46]
K. Jain M. Mahdian and A. Saberi, “A New Greedy Approach for Facility Location Problem,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 2002.]]
[47]
K. Jain and V. Vazirani, “Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 1999.]]
[48]
R. Kannan S. Vempala and A. Vetta, “On Clusterings: Good, Bad and Spectral,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> pp. 367-377, 2000.]]
[49]
O. Kariv and S.L. Hakimi, “An Algorithmic Approach to Network Location Problems, Part II: p-Media ns,” <i>SIAM J. Applied Math.,</i> pp. 539-560, 1979.]]
[50]
L. Kaufman and P.J. Rousseeuw, <i>Finding Groups in Data. An Introduction to Cluster Analysis.</i> New York: Wiley, 1990.]]
[51]
L.E. Kavraki J.C. Latombe R. Motwani and P. Raghavan, “Randomized Query Processing in Robot Path Planning,” <i>J. Computer and System Sciences,</i> vol. 57, pp. 50-60, 1998.]]
[52]
S. Kolliopoulos and S. Rao, “A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem,” <i>Proc. Seventh European Symp. Algorithms,</i> pp. 378-389, 1999.]]
[53]
J.H. Lin and J.S. Vitter, “Approximation Algorithms for Geometric Median Problems,” <i>Information Processing Letters,</i> vol. 44, pp. 245-249, 1992.]]
[54]
J.H. Lin and J.S. Vitter, “ε-Approximations with Minimum Packing Constraint Violations,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> 1992.]]
[55]
O.L. Managasarian, “Mathematical Programming in Data Mining,” <i>Data Mining and Knowledge Discovery,</i> 1997.]]
[56]
G.S. Manku S. Rajagopalan and B. Lindsay, “Approximate Medians and Other Quantiles in One Pass with Limited Memory,” <i>Proc. SIGMOD,</i> 1998.]]
[57]
G.S. Manku S. Rajagopalan and B. Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets,” <i>Proc. SIGMOD,</i> 1999.]]
[58]
D. Marchette, “A Statistical Method for Profiling Network Traffic,” <i>Proc. Workshop Intrusion Detection and Network Monitoring,</i> 1999.]]
[59]
R. Mettu and C.G. Plaxton, “The Onlike Median Problem,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2000.]]
[60]
R. Mettu and C.G. Plaxton, “Optimal Time Bounds for Approximate Clustering,” <i>Proc. Conf. Uncertainty of Artificial Intelligence,</i> 2002.]]
[61]
A. Meyerson, “Online Facility Location,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2001.]]
[62]
<i>Discrete Location Theory,</i> P. Mirchandani and R. Francis, eds. New York: John Wiley and Sons, Inc. 1990.]]
[63]
N. Mishra D. Oblinger and L. Pitt, “Sublinear Time Approximate Clustering,” <i>Proc. ACM Symp. Discrete Algorithms,</i> 2001.]]
[64]
J. Munro and M. Paterson, “Selection and Sorting with Limited Storage,” <i>Theoretical Computer Science,</i> pp. 315-323, 1980.]]
[65]
K. Nauta and F. Lieble, “Offline Network Intrusion Detection: Looking for Footprints,” SAS White Paper, 2000.]]
[66]
R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> pp. 144-155, 1994.]]
[67]
L. O'Callaghan N. Mishra A. Meyerson S. Guha and R. Motwani, “Streaming-Data Algorithms for High-Quality Clustering,” <i>Proc. Int'l Conf. Data Eng.,</i> 2002.]]
[68]
R. Ostrovsky and Y. Rabani, “Polynomial Time Approximation Schemes for Geometric k-Clustering,” <i>Proc. ACM Symp. Foundations of Computer Science,</i> 2000.]]
[69]
C. Procopiuc M. Jones P.K. Agarwal and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,” <i>Proc. SIGMOD,</i> 2002.]]
[70]
G. Sheikholeslami S. Chatterjee and A. Zhang, “Wavecluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> pp. 428-439, 1998.]]
[71]
D.B. Shmoys É. Tardos and K. Aardal, “Approximation Algorithms for Facility Location Problems,” <i>Proc. Ann. ACM Symp. Theory of Computing,</i> pp. 265-274, 1997.]]
[72]
N. Thaper S. Guha P. Indyk and N. Koudas, “Dynamic Multidimensional Histograms,” <i>Proc. SIGMOD,</i> 2002.]]
[73]
M. Thorup, “Quick k-Median, k-Center, and Facility Location for Sparse Graphs,” <i>Proc. Int'l Colloquium on Automata, Languages, and Programming,</i> pp. 249-260, 2001.]]
[74]
V. Vazirani, <i>Approximation Algorithms.</i> Springer Verlag, 2001.]]
[75]
W. Wang J. Yang and R. Muntz, “Sting: A Statistical Information Grid Approach to Spatial Data Mining,” <i>Proc. Int'l Conf. Very Large Data Bases,</i> 1997.]]
[76]
T. Zhang R. Ramakrishnan and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” <i>Proc. SIGMOD,</i> pp. 103-114, 1996.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 15, Issue 3
March 2003
255 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 March 2003

Author Tags

  1. Clustering
  2. approximation algorithms.
  3. data streams

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media