Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Finding frequent items in data streams

Published: 26 January 2004 Publication History

Abstract

We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature.

References

[1]
{1} D. Achlioptas, Database-friendly random projections, in: Proc. 20th ACM Symp. on Principles of Database Systems, 2001, pp. 274-281.
[2]
{2} N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, J. Comput. System Sci. 58 (1) (1999) 137-147.
[3]
{3} M.E. Crovella, M.S. Taqqu, A. Bestavros, Heavy-tailed probability distributions in the world wide web, in: Adler, Feldman, Taqqu (Eds.), A Practical Guide to Heavy Tails, Birkhäuser, Basel, 1998.
[4]
{4} M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, J. Ullman, Computing iceberg queries efficiently, in: Proc. 22nd Internat. Conf. on Very Large Data Bases, 1996, pp. 307-317.
[5]
{5} J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan, An approximate l1-difference algorithm for massive data streams, in: Proc. 40th IEEE Symp. on Foundations of Computer Science, 1999, pp. 501-511.
[6]
{6} J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan, Testing and spot-checking of data streams, in: Proc. 11th ACM-SIAM Symp. on Discrete Algorithms, 2000, pp. 165-174.
[7]
{7} P. Gibbons, Y. Matias, New sampling-based summary statistics for improving approximate query answers, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, 1998, pp. 331-342.
[8]
{8} P. Gibbons, Y. Matias, Synopsis data structures for massive data sets, in: Proc. 10th Ann. ACM-SIAM Symp. on Discrete Algorithms, 1999, pp. 909-910.
[9]
{9} A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss, Fast, small-space algorithms for approximate histogram maintenance, in: Proc. 34th ACM Symp. on Theory of Computing, 2002.
[10]
{10} Google, Google zeitgeist--search patterns, trends, and surprises according to Google, http://www.google.com/press/zeitgeist.html.
[11]
{11} S. Guha, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams, in: Proc. 41st IEEE Symp. on Foundations of Computer Science, 2000, pp. 359-366.
[12]
{12} M. Henzinger, P. Raghavan, S. Rajagopalan, Computing on data streams, Tech. Report SRC TR 1998-011, December 1998.
[13]
{13} P. Indyk, Stable distributions, pseudorandom generators, embeddings and data stream computation, in: Proc. 41st IEEE Symp. on Foundations of Computer Science, 2000, pp. 148-155.
[14]
{14} R. Karp, S. Shenker, C.H. Papadimitriou, A simple algorithm for finding frequent elements in streams and bags, ACM Transactions on Database Systems 28 (2003) 51-55.
[15]
{15} G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proc. 28th Internat. Conf. on Very Large Data Bases, 2002.
[16]
{16} M. Saks, X. Sun, Space lower bounds for distance approximation in the data stream model, in: Proc. 34th ACM Symp. on Theory of Computing, 2002.
[17]
{17} Y. Xie, D. O'Hallaron, Locality for search engine queries and its implications for caching, in: Proc. INFOCOM, 2002.

Cited By

View all
  • (2025)Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data StreamsProceedings of the ACM on Management of Data10.1145/37097113:1(1-26)Online publication date: 11-Feb-2025
  • (2025)Improving compressed matrix multiplication using control variate methodInformation Processing Letters10.1016/j.ipl.2024.106517187:COnline publication date: 1-Jan-2025
  • (2024)Turnstile ℓp leverage score sampling with applicationsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693565(36797-36828)Online publication date: 21-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Theoretical Computer Science
Theoretical Computer Science  Volume 312, Issue 1
Special issue on automata, languages and programming
26 January 2004
138 pages

Publisher

Elsevier Science Publishers Ltd.

United Kingdom

Publication History

Published: 26 January 2004

Author Tags

  1. approximation
  2. frequent items
  3. streaming algorithm

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data StreamsProceedings of the ACM on Management of Data10.1145/37097113:1(1-26)Online publication date: 11-Feb-2025
  • (2025)Improving compressed matrix multiplication using control variate methodInformation Processing Letters10.1016/j.ipl.2024.106517187:COnline publication date: 1-Jan-2025
  • (2024)Turnstile ℓp leverage score sampling with applicationsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693565(36797-36828)Online publication date: 21-Jul-2024
  • (2024)Sparse dimensionality reduction revisitedProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692812(18454-18469)Online publication date: 21-Jul-2024
  • (2024)OctoSketchProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691914(1621-1639)Online publication date: 16-Apr-2024
  • (2024)AutoSketchProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691911(1551-1572)Online publication date: 16-Apr-2024
  • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
  • (2024)RecenTo: Finding Top-K Flows of the Recent PastProceedings of the ACM on Networking10.1145/36768712:CoNEXT3(1-20)Online publication date: 21-Aug-2024
  • (2024)Horizontal Federated Recommender System: A SurveyACM Computing Surveys10.1145/365616556:9(1-42)Online publication date: 3-Apr-2024
  • (2024)Streaming Algorithms with Few State ChangesProceedings of the ACM on Management of Data10.1145/36511452:2(1-28)Online publication date: 14-May-2024
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media