Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2897518.2897558acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Public Access

Beating CountSketch for heavy hitters in insertion streams

Published: 19 June 2016 Publication History

Abstract

Given a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is the number of occurrences of item j in the stream, and F2 = ∑i ∈ [n] fi2. Such a guarantee is considerably stronger than the ℓ1-guarantee, which finds those j for which fj ≥ є m. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such j using Θ(log2 n) bits of space (for constant є > 0). The only known lower bound is Ω(logn) bits of space, which comes from the need to specify the identities of the items found.
In this paper we show one can achieve O(logn loglogn) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including: (1) The first algorithm for estimating F2 simultaneously at all points in a stream using only O(lognloglogn) bits of space, improving a natural union bound. (2) A way to estimate the ℓ norm of a stream up to additive error є √F2 with O(lognloglogn) bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.

References

[1]
Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003.
[2]
Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994.
[3]
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.
[4]
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In FOCS, pages 363–372, 2011.
[5]
Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. CoRR, abs/1106.0365, 2011.
[6]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In PODS, pages 1–16, 2002.
[7]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702–732, 2004.
[8]
Kevin S. Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD, pages 359–370, 1999.
[9]
Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh, and Chandan Saha. Simpler algorithm for estimating frequency moments of data streams. In SODA, pages 708–713, 2006.
[10]
Jean Bourgain and Jelani Nelson. Toward a unified theory of sparse dimensionality reduction in Euclidean space. CoRR, abs/1311.2542, 2013.
[11]
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. Bptree: an 2 heavy hitters algorithm using constant memory. arXiv preprint arXiv:1603.00759, 2016.
[12]
Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger. An optimal algorithm for large frequency moments using o(n 1−2/k ) bits. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 531–544, 2014.
[13]
Vladimir Braverman and Rafail Ostrovsky. Approximating large frequency moments with pick-and-drop sampling. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 42–57. Springer, 2013.
[14]
Yousra Chabchoub, Christine Fricker, and Hanene Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In 21st International Teletraffic Congress, pages 1–8, 2009.
[15]
Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In Conference on Computational Complexity, pages 107–117, 2003.
[16]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3–15, 2004.
[17]
Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In STOC, pages 81–90, 2013.
[18]
Graham Cormode and S. Muthukrishnan. Data stream methods. http://www.cs.rutgers.edu/~muthu/198-3.pdf/, 2003. Lecture 3 of Rutgers Seminar on Processing Massive Data Sets.
[19]
Graham Cormode and S. Muthukrishnan. An improved data stream summary: the Count-Min sketch and its applications. J. Algorithms, 55(1):58–75, 2005.
[20]
Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. Frequency estimation of internet packet streams with limited space. In ESA, pages 348–360, 2002.
[21]
Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270–313, 2003.
[22]
Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB, pages 299–310, 1998.
[23]
Anna C. Gilbert, Yi Li, Ely Porat, and Martin J. Strauss. Approximate sparse recovery: optimizing time and measurements. In STOC, pages 475–484, 2010.
[24]
Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. Efficient computation of iceberg cubes with complex measures. In SIGMOD, pages 1–12, 2001.
[25]
Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1–12, 2000.
[26]
Christian Hidber. Online association rule mining. In SIGMOD, pages 145–156, 1999.
[27]
Zengfeng Huang, Wai Ming Tai, and Ke Yi. Tracking the frequency moments at all times. arXiv preprint arXiv:1412.1763, 2014.
[28]
Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307–323, 2006.
[29]
Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In STOC, pages 202–208, 2005.
[30]
T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In FOCS, pages 765–774, 2009.
[31]
Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In PODS, pages 49–58, 2011.
[32]
Daniel Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 628–639. Springer, 2011.
[33]
Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28:51–55, 2003.
[34]
Abhishek Kumar and Jun (Jim) Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In INFOCOM, 2006.
[35]
Michel Ledoux and Michel Talagrand. Probability in Banach Spaces, volume 23. Springer-Verlag, 1991.
[36]
Yi Li and David P Woodruff. A tight lower bound for high frequency moment estimation with small error. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 623–638. Springer, 2013.
[37]
Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. PVLDB, 5(12):1699, 2012.
[38]
Raghu Meka. A PTAS for computing the supremum of Gaussian processes. In FOCS, pages 217–222. IEEE, 2012.
[39]
Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In STOC, pages 91–100, 2013.
[40]
Gregory T. Minton and Eric Price. Improved concentration bounds for Count-Sketch. In SODA, pages 669–686, 2014.
[41]
Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143–152, 1982.
[42]
Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error lp-sampling with applications. In SODA, pages 1143–1160, 2010.
[43]
S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
[44]
Jelani Nelson and Huy L. Nguyen. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In FOCS, pages 117–126, 2013.
[45]
Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.
[46]
Eric Price. Efficient sketches for the set query problem. In SODA, pages 41–56, 2011.
[47]
Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB, pages 432–444, 1995.
[48]
Michel Talagrand. Majorizing measures: The generic chaining. The Annals of Probability, 24(3), 1996.
[49]
Michel Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems, volume 60. Springer Science & Business Media, 2014.
[50]
Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM J. Comput., 41(2):293–331, 2012.
[51]
Hannu Toivonen. Sampling large databases for association rules. In VLDB, pages 134–145, 1996.

Cited By

View all
  • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
  • (2023)Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00093(1515-1550)Online publication date: 6-Nov-2023
  • (2023)Streaming Euclidean k-median and k-means with o(log n) Space2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00057(883-908)Online publication date: 6-Nov-2023
  • Show More Cited By

Index Terms

  1. Beating CountSketch for heavy hitters in insertion streams

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing
      June 2016
      1141 pages
      ISBN:9781450341325
      DOI:10.1145/2897518
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 June 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Chaining
      2. Data Streams
      3. Heavy Hitters

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      STOC '16
      Sponsor:
      STOC '16: Symposium on Theory of Computing
      June 19 - 21, 2016
      MA, Cambridge, USA

      Acceptance Rates

      Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)67
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 03 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
      • (2023)Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00093(1515-1550)Online publication date: 6-Nov-2023
      • (2023)Streaming Euclidean k-median and k-means with o(log n) Space2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00057(883-908)Online publication date: 6-Nov-2023
      • (2022)Memory bounds for the experts problemProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3520069(1158-1171)Online publication date: 9-Jun-2022
      • (2022)Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS52979.2021.00116(1183-1196)Online publication date: Feb-2022
      • (2022)Three‐wise independent random walks can be slightly unboundedRandom Structures & Algorithms10.1002/rsa.2107561:3(573-598)Online publication date: 3-Jan-2022
      • (2021)Linear and kernel classification in the streaming modelProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541365(14407-14420)Online publication date: 6-Dec-2021
      • (2021)A simple proof of a new set disjointness with applications to data streamsProceedings of the 36th Computational Complexity Conference10.4230/LIPIcs.CCC.2021.37Online publication date: 20-Jul-2021
      • (2021)Timely Reporting of Heavy Hitters Using External MemoryACM Transactions on Database Systems10.1145/347239246:4(1-35)Online publication date: 15-Nov-2021
      • (2020)Timely Reporting of Heavy Hitters using External MemoryProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380598(1431-1446)Online publication date: 11-Jun-2020
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media