Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Stingy sketch: a sketch framework for accurate and fast frequency estimation

Published: 01 March 2022 Publication History

Abstract

Recording the frequency of items in highly skewed data streams is a fundamental and hot problem in recent years. The literature demonstrates that sketch is the most promising solution. The typical metrics to measure a sketch are accuracy and speed, but existing sketches make only trade-offs between the two dimensions. Our proposed solution is a new sketch framework called Stingy sketch with two key techniques: Bit-pinching Counter Tree (BCTree) and Prophet Queue (PQueue) which optimizes both the accuracy and speed. The key idea of BCTree is to split a large fixed-size counter into many small nodes of a tree structure, and to use a precise encoding to perform carry-in operations with low processing overhead. The key idea of PQueue is to use pipelined prefetch technique to make most memory accesses happen in L2 cache without losing precision. Importantly, the two techniques are cooperative so that Stingy sketch can improve accuracy and speed simultaneously. Extensive experimental results show that Stingy sketch is up to 50% more accurate than the SOTA of accuracy-oriented sketches and is up to 33% faster than the SOTA of speed-oriented sketches.

References

[1]
Ran Ben-Basat, Gil Einziger, Isaac Keslassy, Ariel Orda, Shay Vargaftik, and Erez Waisbard. Memento: making sliding windows efficient for heavy hitters. In CoNEXT, pages 254--266. ACM, 2018.
[2]
Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. Faster and more accurate measurement through additive-error counters. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, 2020.
[3]
Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. SALSA: self-adjusting lean streaming analytics. In ICDE, pages 864--875. IEEE, 2021.
[4]
Y. Tong, J. Gong, H. Zhang, Z. Lei, and X. Li. Heavyguardian: Separate and guard hot items in data streams. In the 24th ACM SIGKDD International Conference, 2018.
[5]
Daniel Ting. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the 2018 International Conference on Management of Data, 2018.
[6]
Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In KDD, pages 1574--1584. ACM, 2020.
[7]
B. Shi, Z. Zhao, Y. Peng, F. Li, and J. M. Phillips. At-the-time and back-in-time persistent sketches. In SIGMOD/PODS '21: International Conference on Management of Data, 2021.
[8]
Y. Izenov, A. Datta, F. Rusu, and J. H. Shin. Compass: Online sketch-based query optimization for in-memory databases. In SIGMOD/PODS '21: International Conference on Management of Data, 2021.
[9]
A. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire. Correlation sketches for approximate join-correlation queries. In SIGMOD/PODS '21: International Conference on Management of Data, 2021.
[10]
Rundong Li, Pinghui Wang, Jiongli Zhu, Junzhou Zhao, and Kai Ye. Building fast and compact sketches for approximately multi-set multi-membership querying. In SIGMOD/PODS '21: International Conference on Management of Data, 2021.
[11]
Z. Dai, A. Desai, R. Heckel, and A. Shrivastava. Active sampling count sketch (ascs) for online sparse estimation of a trillion scale covariance matrix. In SIGMOD/PODS '21: International Conference on Management of Data, 2021.
[12]
Peng Jia, Pinghui Wang, Junzhou Zhao, Shuo Zhang, Yiyan Qi, Min Hu, Chao Deng, and Xiaohong Guan. Bidirectionally densifying LSH sketches with empty bins. In SIGMOD Conference, pages 830--842. ACM, 2021.
[13]
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, 2019, pages 25--33. ACM, 2019.
[14]
Yang Yang, Ying Zhang, Wenjie Zhang, and Zengfeng Huang. GB-KMV: an augmented KMV sketch for approximate containment similarity search. In ICDE, pages 458--469. IEEE, 2019.
[15]
Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. In LATIN, Lecture Notes in Computer Science, pages 29--38, 2004.
[16]
Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., pages 270--313, 2003.
[17]
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693--703, 2002.
[18]
Kai Cheng, Limin Xiang, Mizuho Iwaihara, Haiyan Xu, and Mukesh K. Mohania. Time-decaying bloom filters for data streams with skewed distributions. In RIDE, pages 63--69. IEEE Computer Society, 2005.
[19]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: a sketch framework for frequency estimation of data streams. Proc. VLDB Endow., 10(11):1442--1453, 2017.
[20]
Tong Yang, Siang Gao, Zhouyi Sun, Yufei Wang, Yulong Shen, and Xiaoming Li. Diamond sketch: Accurate per-flow measurement for big streaming data. IEEE Trans. Parallel Distributed Syst., pages 2650--2662, 2019.
[21]
Tao Li, Shigang Chen, and Yibei Ling. Per-flow traffic measurement through randomized counter sharing. IEEE/ACM Trans. Netw., pages 1622--1634, 2012.
[22]
Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In SIGMOD Conference, pages 1449--1463. ACM, 2016.
[23]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.
[24]
Amit Goyal, Hal Daumé III, and Graham Cormode. Sketch algorithms for estimating point queries in NLP. In EMNLP-CoNLL, pages 1093--1103. ACL, 2012.
[25]
Peixiang Zhao, Charu C. Aggarwal, and Min Wang. gsketch: On query estimation in graph streams. Proc. VLDB Endow., pages 193--204, 2011.
[26]
George Kollios, John W. Byers, Jeffrey Considine, Marios Hadjieleftheriou, and Feifei Li. Robust aggregation in sensor networks. IEEE Data Eng. Bull., pages 26--32, 2005.
[27]
Gero Dittmann and Andreas Herkersdorf. Network processor load balancing for high-speed links. In Proceedings of the 2002 International Symposium on Performance Evaluation of Computer and Telecommunication Systems, 2002.
[28]
Atul Kant Kaushik, Emmanuel S. Pilli, and Ramesh C. Joshi. Network forensic analysis by correlation of attacks with network attributes. In ICT, pages 124--128, 2010.
[29]
Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman, Roy Friedman, and Vyas Sekar. Nitrosketch: robust and general sketch-based monitoring in software switches. In SIGCOMM, pages 334--350. ACM, 2019.
[30]
Alex D Breslow and Nuwan S Jayasena. Morton filters: faster, space-efficient cuckoo filters via biasing, compression, and decoupled logical sparsity. Proceedings of the VLDB Endowment, pages 1041--1055, 2018.
[31]
Qian Liu, Haipeng Dai, Alex X. Liu, Qi Li, Xiaoyu Wang, and Jiaqi Zheng. Cache assisted randomized sharing counters in network measurement. In ICPP, pages 40:1--40:10. ACM, 2018.
[32]
Yi Lu, Andrea Montanari, Balaji Prabhakar, Sarang Dharmapurikar, and Abdul Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, pages 121--132. ACM, 2008.
[33]
Min Chen, Shigang Chen, and Zhiping Cai. Counter tree: A scalable counter architecture for per-flow traffic measurement. IEEE/ACM Trans. Netw., pages 1249--1262, 2017.
[34]
Yun William Yu and Griffin Weber. Hyperminhash: Jaccard index sketching in loglog space. CoRR, 2017.
[35]
Junzhi Gong, Tong Yang, Yang Zhou, Dongsheng Yang, Shigang Chen, Bin Cui, and Xiaoming Li. ABC: A practicable sketch framework for non-uniform multisets. In IEEE BigData, pages 2380--2389. IEEE Computer Society, 2017.
[36]
Tong Yang, Jiaqi Xu, Xilai Liu, Peng Liu, Lun Wang, Jun Bi, and Xiaoming Li. A generic technique for sketches to adapt to different counting ranges. In INFOCOM, pages 2017--2025, 2019.
[37]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ICDT, pages 398--412. Springer, 2005.
[38]
Related source code. https://github.com/StingySketch/Stingy-Sketch.
[39]
Open source code of augment and pyramid. https://github.com/zhouyangpkuer/Pyramid_Sketch_Framework, 2017.
[40]
Murmur hashing source code. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp.
[41]
The web stream dataset. http://fimi.ua.ac.be/data/.
[42]
The caida anonymized internet traces dataset. http://www.caida.org/data/overview/.
[43]
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP, volume 2380 of Lecture Notes in Computer Science, pages 693--703. Springer, 2002.
[44]
Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., pages 51--55, 2003.
[45]
Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28:51--55, 2003.
[46]
Lukasz Golab, David DeHaan, Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. Identifying frequent items in sliding windows over on-line packet streams. In Internet Measurement Conference, pages 173--178. ACM, 2003.
[47]
Nishad Manerikar and Themis Palpanas. Frequent items in streaming data: An experimental evaluation of the state-of-the-art. Data Knowl., 2009.
[48]
Balachander Krishnamurthy, Subhabrata Sen, Yin Zhang, and Yan Chen. Sketch-based change detection: methods, evaluation, and applications. In Internet Measurement Conference, pages 234--247. ACM, 2003.
[49]
Robert T. Schweller, Zhichun Li, Yan Chen, Yan Gao, Ashish Gupta, Yin Zhang, Peter A. Dinda, Ming-Yang Kao, and Gokhan Memik. Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw., 15(5):1059--1072, 2007.
[50]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for data centers. In NSDI, pages 311--324. USENIX Association, 2016.
[51]
Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. Persistent data sketching. In SIGMOD Conference, pages 795--810. ACM, 2015.
[52]
Haipeng Dai, Muhammad Shahzad, Alex X. Liu, and Yuankun Zhong. Finding persistent items in data streams. Proc. VLDB Endow., pages 289--300, 2016.
[53]
Shobha Venkataraman, Dawn Xiaodong Song, Phillip B. Gibbons, and Avrim Blum. New streaming algorithms for fast detection of superspreaders. In NDSS. The Internet Society, 2005.
[54]
Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. Proc. VLDB Endow., page 1699, 2012.
[55]
Tong Yang, Haowei Zhang, Jinyang Li, Junzhi Gong, Steve Uhlig, Shigang Chen, and Xiaoming Li. Heavykeeper: An accurate algorithm for finding top-k elephant flows. IEEE/ACM Trans. Netw., pages 1845--1858, 2019.
[56]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: adaptive and fast network-wide measurements. In SIGCOMM, pages 561--575. ACM, 2018.

Cited By

View all
  • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
  • (2024)SuperGuardianInformation Systems10.1016/j.is.2024.102351122:COnline publication date: 2-Jul-2024
  • (2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 7
March 2022
208 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2022
Published in PVLDB Volume 15, Issue 7

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)21
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
  • (2024)SuperGuardianInformation Systems10.1016/j.is.2024.102351122:COnline publication date: 2-Jul-2024
  • (2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023
  • (2023)BitSense: Universal and Nearly Zero-Error Optimization for Sketch Counters with Compressive SensingProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604865(220-238)Online publication date: 10-Sep-2023
  • (2023)JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product EstimationProceedings of the ACM on Management of Data10.1145/35889351:1(1-26)Online publication date: 30-May-2023
  • (2023)SketchPolymer: Estimate Per-item Tail Quantile Using One SketchProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599505(590-601)Online publication date: 6-Aug-2023
  • (2023)MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with SketchesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599433(2838-2849)Online publication date: 6-Aug-2023
  • (2023)MicroscopeSketch: Accurate Sliding Estimation Using Adaptive ZoomingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599432(2660-2671)Online publication date: 6-Aug-2023
  • (2023)Learning-Based Dichotomy Graph Sketch for Summarizing Graph Streams with High AccuracyKnowledge Science, Engineering and Management10.1007/978-3-031-40286-9_5(47-59)Online publication date: 16-Aug-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media