Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3131365.3131407acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

A high-performance algorithm for identifying frequent items in data streams

Published: 01 November 2017 Publication History

Abstract

Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited many times due its practicality and applicability. We describe a highly optimized version of Misra and Gries' algorithm that is suitable for deployment in industrial settings. Our code is made public via an open source library called Data Sketches that is already used by several companies and production systems.
Our algorithm improves on two theoretical and practical aspects of prior work. First, it handles weighted updates in amortized constant time, a common requirement in practice. Second, it uses a simple and fast method for merging summaries that asymptotically improves on prior work even for unweighted streams. We describe experiments confirming that our algorithms are more efficient than prior proposals.

References

[1]
2016. The CAIDA UCSD Anonymized Internet Traces 2016 Dataset. (2016). https://www.caida.org/data/passive/passive_2016_dataset.xml. Specific files used: equinix-chicago.dirA.20160121-130000.UTC.anon.pcap.gz to equinix-chicago.dirA.20160121-131800.UTC.anon.pcap.gz.
[2]
Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Trans. Database Syst. 38, 4 (2013), 26.
[3]
Nir Ailon, Zohar Shay Karnin, Edo Liberty, and Yoelle Maarek. 2013. Threading machine generated email. In Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4--8, 2013. 405--414.
[4]
Ran Ben Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 2017. Optimal elephant flow detection. arXiv preprint arXiv:1701.04021 (2017). To Appear in IEEE INFOCOM 2017.
[5]
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. 2017. Constant Time Updates in Hierarchical Heavy Hitters. arXiv preprint arXiv.1707.06778 (2017). To Appear in ACM SIGCOMM 2017.
[6]
Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst. 35, 4 (2010), 26.
[7]
Arnab Bhattacharyya, Palash Dey, and David P. Woodruff. 2016. An Optimal Algorithm for ℓ1-Heavy Hitters in Insertion Streams and Related Problems. In Proceedings of PODS. 385--400.
[8]
Supratik Bhattacharyya, Andre Madeira, S Muthukrishnan, and Tao Ye. 2007. How to scalably and accurately skip past streams. In Data Engineering Workshop, 2007 IEEE 23rd International Conference on. IEEE, 654--663.
[9]
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. 2016. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of STOC. 740--753.
[10]
Massimo Cafaro and Marco Pulimeno. 2016. Merging Frequent Summaries. In ICTCS. 280--285.
[11]
Massimo Cafaro and Piergiulio Tempesta. 2011. Finding frequent items in parallel. Concurrency and Computation: Practice and Experience 23, 15 (2011), 1774--1788.
[12]
Amit Chakrabarti, Graham Cormode, and Andrew McGregor. 2010. A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Algorithms 6, 3 (2010).
[13]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding Frequent Items in Data Streams. In Proceedings of ICALP. 693--703.
[14]
Graham Cormode and Marios Hadjieleftheriou. 2010. Methods for finding frequent items in data streams. VLDB J. 19, 1 (2010), 3--20.
[15]
Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 464--475.
[16]
Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. 2004. Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004, Gerhard Weikum, Arnd Christian König, and Stefan Deßloch (Eds.). ACM, 155--166.
[17]
Graham Cormode and S. Muthukrishnan. 2004. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. In Proceedings of LATIN. 29--38.
[18]
Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms. Springer, 348--360.
[19]
Nick Duffield, Carsten Lund, and Mikkel Thorup. 2001. Charging from sampled network usage. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement. ACM, 245--256.
[20]
Cristian Estan, Stefan Savage, and George Varghese. 2003. Automatically inferring patterns of resource consumption in network traffic. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. ACM, 137--148.
[21]
Cristian Estan and George Varghese. 2002. New Directions in Traffic Measurement and Accounting. In Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM '02). ACM, New York, NY, USA, 323--336.
[22]
Wenjia Fang and Larry Peterson. 1999. Inter-AS traffic patterns and their implications. In Global Telecommunications Conference, 1999. GLOBECOM'99, Vol. 3. IEEE, 1859--1868.
[23]
Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, Jennifer Rexford, and Fred True. 2001. Deriving traffic demands for operational IP networks: Methodology and experience. IEEE/ACM Transactions on Networking (ToN) 9, 3 (2001), 265--280.
[24]
Lukasz Golab, David DeHaan, Erik D. Demaine, Alejandro Lopez-Ortiz, and J. Ian Munro. 2003. Identifying Frequent Items in Sliding Windows over Online Packet Streams. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement (IMC '03). ACM, New York, NY, USA, 173--178.
[25]
Yu Gu, Andrew McCallum, and Donald F. Towsley. 2005. Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation. In Proceedings of the 5th Internet Measurement Conference, IMC 2005, Berkeley, California, USA, October 19--21, 2005. USENIX Association, 345--350. http://www.usenix.org/events/imc05/tech/gu.html
[26]
John Hershberger, Nisheeth Shrivastava, Subhash Suri, and Csaba D Tóth. 2005. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 338--347.
[27]
C. A. R. Hoare. 1961. Algorithm65: Find. Commun. ACM 4, 7 (July 1961), 321--322.
[28]
Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic feature distributions. ACM SIGCOMM Computer Communication Review 35, 4 (2005), 217--228.
[29]
Edo Liberty. 2013. Simple and deterministic matrix sketching. In Proceedings of KDD. 581--588.
[30]
Yuan Lin and Hongyan Liu. 2007. Separator: sifting hierarchical heavy hitters accurately from data streams. Advanced Data Mining and Applications (2007), 170--182.
[31]
Amit Manjhi, Suman Nath, and Phillip B. Gibbons. 2005. Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams. In Proceedings of SIGMOD. 287--298.
[32]
Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, and Christopher Olston. 2005. Finding (Recently) Frequent Items in Distributed Data Streams. In Proceedings of ICDE. 767--778.
[33]
Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate Frequency Counts over Data Streams. In Proceedings of VLDB. VLDB Endowment, 346--357. http://dl.acm.org/citation.cfm?id=1287369.1287400
[34]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proceedings of ICDT. 398--412.
[35]
J. Misra and David Gries. 1982. Finding repeated elements. Science of Computer Programming 2, 2 (1982), 143--152.
[36]
Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. 2012. Hierarchical Heavy Hitters with the Space Saving Algorithm. In Proceedings of ALENEX. 160--174.
[37]
Michael Mitzenmacher and Eli Upfal. 2005. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press.
[38]
Rong Pan, Lee Breslau, Balaji Prabhakar, and Scott Shenker. 2003. Approximate fairness through differential dropping. ACM SIGCOMM Computer Communication Review 33, 2 (2003), 23--39.
[39]
Frederic Raspall, Sebastia Sallent, and Josep Yufera. 2006. Shared-state Sampling. In Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement (IMC '06). ACM, New York, NY, USA, 1--14.
[40]
Lee Rhodes, Kevin Lang, Alexander Saydakov, Justin Thaler, Edo Liberty, and Jon Malkin. 2015. DataSketches: A Java software library for streaming data algorithms. Apache License, Version 2.0. (2015). https://datasketches.github.io.
[41]
Vyas Sekar, Nick G Duffield, Oliver Spatscheck, Jacobus E van der Merwe, and Hui Zhang. 2006. LADS: Large-scale Automated DDoS Detection System. In USENIX Annual Technical Conference, General Track. 171--184.
[42]
Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S. Muthukrishnan, and Jennifer Rexford. 2016. Smoking Out the Heavy-Hitter Flows with HashPipe. CoRR abs/1611.04825 (2016). http://arxiv.org/abs/1611.04825 To appear in SDN 2017.
[43]
Patrick Truong and Fabrice Guillemin. 2009. Identification of heavyweight address prefix pairs in IP traffic. In Teletraffic Congress, 2009. ITC 21 2009. 21st International. IEEE, 1--8.
[44]
Arno Wagner and Bernhard Plattner. 2005. Entropy Based Worm and Anomaly Detection in Fast IP Networks. In 14th IEEE International Workshops on Enabling Technologies (WETICE 2005), 13--15 June 2005, Linköping, Sweden. IEEE Computer Society, 172--177.
[45]
Kuai Xu, Zhi-Li Zhang, and Supratik Bhattacharyya. 2005. Profiling internet backbone traffic: behavior models and applications. In Proceedings of the ACM SIGCOMM 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Philadelphia, Pennsylvania, USA, August 22--26, 2005, Roch Guérin, Ramesh Govindan, and Greg Minshall (Eds.). ACM, 169--180.
[46]
Yin Zhang, Sumeet Singh, Subhabrata Sen, Nick G. Duffield, and Carsten Lund. 2004. Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In Proceedings of the 4th ACM SIGCOMM Internet Measurement Conference, IMC 2004, Taormina, Sicily, Italy, October 25--27, 2004, Alfio Lombardo and James F. Kurose (Eds.). ACM, 101--114.
[47]
Haiquan (Chuck) Zhao, Ashwin Lall, Mitsunori Ogihara, Oliver Spatscheck, Jia Wang, and Jun Xu. 2007. A Data Streaming Algorithm for Estimating Entropies of Od Flows. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC '07). ACM, New York, NY, USA, 279--290.
[48]
Qi Zhao, Abhishek Kumar, and Jun Xu. 2005. Joint Data Streaming and Sampling Techniques for Detection of Super Sources and Destinations. In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement (IMC '05). USENIX Association, Berkeley, CA, USA, 7--7. http://dl.acm.org/citation.cfm?id=1251086.1251093

Cited By

View all
  • (2024)SQUID: Faster Analytics via Sampled Quantile EstimationProceedings of the ACM on Networking10.1145/36768732:CoNEXT3(1-23)Online publication date: 21-Aug-2024
  • (2023)Together is Better: Heavy Hitters Quantile EstimationProceedings of the ACM on Management of Data10.1145/35889371:1(1-25)Online publication date: 30-May-2023
  • (2023)Compressing Distributed Network Sketches With Traffic-Aware SummariesIEEE Transactions on Network and Service Management10.1109/TNSM.2022.317229920:2(1962-1975)Online publication date: Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IMC '17: Proceedings of the 2017 Internet Measurement Conference
November 2017
509 pages
ISBN:9781450351188
DOI:10.1145/3131365
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. frequent items
  2. mergeable summaries
  3. streaming algorithms

Qualifiers

  • Research-article

Conference

IMC '17
IMC '17: Internet Measurement Conference
November 1 - 3, 2017
London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 277 of 1,083 submissions, 26%

Upcoming Conference

IMC '24
ACM Internet Measurement Conference
November 4 - 6, 2024
Madrid , AA , Spain

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SQUID: Faster Analytics via Sampled Quantile EstimationProceedings of the ACM on Networking10.1145/36768732:CoNEXT3(1-23)Online publication date: 21-Aug-2024
  • (2023)Together is Better: Heavy Hitters Quantile EstimationProceedings of the ACM on Management of Data10.1145/35889371:1(1-25)Online publication date: 30-May-2023
  • (2023)Compressing Distributed Network Sketches With Traffic-Aware SummariesIEEE Transactions on Network and Service Management10.1109/TNSM.2022.317229920:2(1962-1975)Online publication date: Jun-2023
  • (2023) Randomized Counter-Based Algorithms for Frequency Estimation over Data Streams in Space Theoretical Computer Science10.1016/j.tcs.2023.114317(114317)Online publication date: Nov-2023
  • (2022)Adaptive Threshold SamplingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526122(1612-1625)Online publication date: 10-Jun-2022
  • (2022)Efficient and Accurate Flow Record Collection With HashFlowIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309944233:5(1069-1083)Online publication date: 1-May-2022
  • (2022)Memento: Making Sliding Windows Efficient for Heavy HittersIEEE/ACM Transactions on Networking10.1109/TNET.2021.313238530:4(1440-1453)Online publication date: Aug-2022
  • (2021)Distributed Sketching with Traffic-Aware Summaries2021 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking52078.2021.9472827(1-8)Online publication date: 21-Jun-2021
  • (2021)Secure Multi-party Computation of Differentially Private Heavy HittersProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security10.1145/3460120.3484557(2361-2377)Online publication date: 12-Nov-2021
  • (2021)Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303673832:4(900-917)Online publication date: 1-Apr-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media