Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Heavy hitters via cluster-preserving clustering

Published: 24 July 2019 Publication History

Abstract

We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-k items in a universe of size n using the same asymptotic O(k log n) words of memory and O(log n) update time as the COUNTMIN and COUNTSKETCH, but requiring only O(k poly(log n)) time to answer queries instead of the O(n log n) time of the other two. The notion of "approximation" is the same l2 sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory.
Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms.

References

[1]
Alon, N., Chung, F.R.K. Explicit construction of linear sized tolerant networks. Discrete Math. 72 (1988), 15--19.
[2]
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702--732.
[3]
Braverman, V., Chestnut, S.R., Ivkin, N., Nelson, J., Wang, Z., Woodruff, D.P. BPTree: An &ell;<sub>2</sub> heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) (2017), ACM, Chicago, IL, 361--376.
[4]
Braverman, V., Chestnut, S.R., Ivkin, N., Woodruff, D.P. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of the 48th STOC (2016), ACM, Cambridge, MA.
[5]
Charikar, M., Chen, K., Farach-Colton, M. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3--15.
[6]
Cormode, G., Hadjieleftheriou, M. Finding frequent items in data streams. PVLDB 1, 2 (2008), 1530--1541.
[7]
Cormode, G., Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58--75.
[8]
Gilbert, A.C., Li, Y., Porat, E., Strauss, M.J. For-all sparse recovery in near-optimal time. In Proceedings of the 41st ICALP (2014), Springer, Copenhagen, Denmark, 538--550.
[9]
Jowhari, H., Saglam, M., Tardos, G. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th PODS (2011), ACM, Athens, Greece, 49--58.
[10]
Kannan, R., Vempala, S., Vetta, A. On clusterings: Good, bad and spectral. J. ACM 51, 3 (2004), 497--515.
[11]
Larsen, K.G., Nelson, J., Nguyễn, H.L., Thorup, M. Heavy hitters via cluster-preserving clustering. CoRR, abs/1511.01111 (2016).
[12]
Metwally, A., Agrawal, D., El Abbadi, A. Efficient computation of frequent and top-k elements in data streams, In Proceedings of the 10th ICDT (2005), Springer, Edinburgh, UK, 398--412.
[13]
Misra, J., Gries, D. Finding repeated elements. Sci. Comput. Program 2, 2 (1982), 143--152.
[14]
Orecchia, L., Sachdeva, S., Vishnoi, N.K. Approximating the exponential, the Lanczos method and an Õ(m)-time spectral algorithm for balanced separator. In Proceedings of the 44th STOC (2012), 1141--1160.
[15]
Orecchia, L., Vishnoi, N.K. Towards an SDP-based approach to spectral methods: A nearly-linear-time algorithm for graph partitioning and decomposition. In Proceedings of the 22nd SODA (2011), SIAM, San Francisco, CA, 532--545.
[16]
Spielman, D.A. Linear-time encodable and decodable error-correcting codes. IEEE Trans. Information Theory 42, 6 (1996), 1723--1731.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 62, Issue 8
August 2019
88 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3351434
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2019
Published in CACM Volume 62, Issue 8

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10,214
  • Downloads (Last 6 weeks)60
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Streaming Algorithms for Geometric Steiner ForestACM Transactions on Algorithms10.1145/366366620:4(1-38)Online publication date: 5-Aug-2024
  • (2023)Fast splitting algorithms for sparsity-constrained and noisy group testingInformation and Inference: A Journal of the IMA10.1093/imaiai/iaac03112:2(1141-1171)Online publication date: 7-Jan-2023
  • (2022)Towards continuous consistency axiomApplied Intelligence10.1007/s10489-022-03710-153:5(5635-5663)Online publication date: 30-Jun-2022
  • (2022)A New Clustering Preserving Transformation for k-Means Algorithm OutputFoundations of Intelligent Systems10.1007/978-3-031-16564-1_30(315-322)Online publication date: 26-Sep-2022
  • (2021)A simple proof of a new set disjointness with applications to data streamsProceedings of the 36th Computational Complexity Conference10.4230/LIPIcs.CCC.2021.37Online publication date: 20-Jul-2021
  • (2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media