research-article

Public Access

Heavy hitters via cluster-preserving clustering

Authors:

Kasper Green Larsen,

Huy L. Nguyễn,

Mikkel ThorupAuthors Info & Claims

Communications of the ACM, Volume 62, Issue 8

Pages 95 - 100

https://doi.org/10.1145/3339185

Published: 24 July 2019 Publication History

All formats PDF

Abstract

We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-k items in a universe of size n using the same asymptotic O(k log n) words of memory and O(log n) update time as the COUNTMIN and COUNTSKETCH, but requiring only O(k poly(log n)) time to answer queries instead of the O(n log n) time of the other two. The notion of "approximation" is the same l₂ sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory.

Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms.

References

[1]

Alon, N., Chung, F.R.K. Explicit construction of linear sized tolerant networks. Discrete Math. 72 (1988), 15--19.

Digital Library

[2]

Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702--732.

Digital Library

[3]

Braverman, V., Chestnut, S.R., Ivkin, N., Nelson, J., Wang, Z., Woodruff, D.P. BPTree: An &ell;<sub>2</sub> heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) (2017), ACM, Chicago, IL, 361--376.

Digital Library

[4]

Braverman, V., Chestnut, S.R., Ivkin, N., Woodruff, D.P. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of the 48th STOC (2016), ACM, Cambridge, MA.

Digital Library

[5]

Charikar, M., Chen, K., Farach-Colton, M. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3--15.

Digital Library

[6]

Cormode, G., Hadjieleftheriou, M. Finding frequent items in data streams. PVLDB 1, 2 (2008), 1530--1541.

Digital Library

[7]

Cormode, G., Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58--75.

Digital Library

[8]

Gilbert, A.C., Li, Y., Porat, E., Strauss, M.J. For-all sparse recovery in near-optimal time. In Proceedings of the 41st ICALP (2014), Springer, Copenhagen, Denmark, 538--550.

[9]

Jowhari, H., Saglam, M., Tardos, G. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th PODS (2011), ACM, Athens, Greece, 49--58.

Digital Library

[10]

Kannan, R., Vempala, S., Vetta, A. On clusterings: Good, bad and spectral. J. ACM 51, 3 (2004), 497--515.

Digital Library

[11]

Larsen, K.G., Nelson, J., Nguyễn, H.L., Thorup, M. Heavy hitters via cluster-preserving clustering. CoRR, abs/1511.01111 (2016).

[12]

Metwally, A., Agrawal, D., El Abbadi, A. Efficient computation of frequent and top-k elements in data streams, In Proceedings of the 10th ICDT (2005), Springer, Edinburgh, UK, 398--412.

Digital Library

[13]

Misra, J., Gries, D. Finding repeated elements. Sci. Comput. Program 2, 2 (1982), 143--152.

[14]

Orecchia, L., Sachdeva, S., Vishnoi, N.K. Approximating the exponential, the Lanczos method and an Õ(m)-time spectral algorithm for balanced separator. In Proceedings of the 44th STOC (2012), 1141--1160.

Digital Library

[15]

Orecchia, L., Vishnoi, N.K. Towards an SDP-based approach to spectral methods: A nearly-linear-time algorithm for graph partitioning and decomposition. In Proceedings of the 22nd SODA (2011), SIAM, San Francisco, CA, 532--545.

Digital Library

[16]

Spielman, D.A. Linear-time encodable and decodable error-correcting codes. IEEE Trans. Information Theory 42, 6 (1996), 1723--1731.

Cited By

Czumaj AJiang SKrauthgamer RVeselý P(2024)Streaming Algorithms for Geometric Steiner ForestACM Transactions on Algorithms10.1145/366366620:4(1-38)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3663666
Price EScarlett JTan N(2023)Fast splitting algorithms for sparsity-constrained and noisy group testingInformation and Inference: A Journal of the IMA10.1093/imaiai/iaac03112:2(1141-1171)Online publication date: 7-Jan-2023
https://doi.org/10.1093/imaiai/iaac031
Kłopotek MKłopotek R(2022)Towards continuous consistency axiomApplied Intelligence10.1007/s10489-022-03710-153:5(5635-5663)Online publication date: 30-Jun-2022
https://dl.acm.org/doi/10.1007/s10489-022-03710-1
Show More Cited By

Index Terms

Heavy hitters via cluster-preserving clustering
1. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
    2. Approximation algorithms analysis
      1. Facility location and clustering

Recommendations

Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

Given a stream p₁, …, p_m of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ₂-heavy hitters, i.e., those items j for which f_j ≥ є √F₂, where f_j is ...
Identifying correlated heavy-hitters in a two-dimensional data stream

We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an ...
Finding Subcube Heavy Hitters in Analytics Data Streams
WWW '18: Proceedings of the 2018 World Wide Web Conference

Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 62, Issue 8

August 2019

88 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3351434

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2019

Published in CACM Volume 62, Issue 8

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

ONR
NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
26,065
Total Downloads

Downloads (Last 12 months)10,214
Downloads (Last 6 weeks)60

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Czumaj AJiang SKrauthgamer RVeselý P(2024)Streaming Algorithms for Geometric Steiner ForestACM Transactions on Algorithms10.1145/366366620:4(1-38)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3663666
Price EScarlett JTan N(2023)Fast splitting algorithms for sparsity-constrained and noisy group testingInformation and Inference: A Journal of the IMA10.1093/imaiai/iaac03112:2(1141-1171)Online publication date: 7-Jan-2023
https://doi.org/10.1093/imaiai/iaac031
Kłopotek MKłopotek R(2022)Towards continuous consistency axiomApplied Intelligence10.1007/s10489-022-03710-153:5(5635-5663)Online publication date: 30-Jun-2022
https://dl.acm.org/doi/10.1007/s10489-022-03710-1
Kłopotek M(2022)A New Clustering Preserving Transformation for k-Means Algorithm OutputFoundations of Intelligent Systems10.1007/978-3-031-16564-1_30(315-322)Online publication date: 26-Sep-2022
https://doi.org/10.1007/978-3-031-16564-1_30
Kamath APrice EWoodruff D(2021)A simple proof of a new set disjointness with applications to data streamsProceedings of the 36th Computational Complexity Conference10.4230/LIPIcs.CCC.2021.37Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.4230/LIPIcs.CCC.2021.37
Besta MFischer MKalavri VKapralov MHoefler T(2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021
https://doi.org/10.1109/TPDS.2021.3131677

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents