research-article

SuperGuardian: : Superspreader removal for cardinality estimation in data streaming

Authors:

Hongchang Chen,

Quan RenAuthors Info & Claims

Volume 122, Issue C

https://doi.org/10.1016/j.is.2024.102351

Published: 01 May 2024 Publication History

Abstract

Measuring flow cardinality is one of the fundamental problems in data stream mining, where a data stream is modeled as a sequence of items from different flows and the cardinality of a flow is the number of distinct items in the flow. Many existing sketches based on estimator sharing have been proposed to deal with huge flows in data streams. However, these sketches suffer from inefficient memory usage due to allocating the same memory size for each estimator without considering the skewed cardinality distribution. To address this issue, we propose SuperGuardian to improve the memory efficiency of existing sketches. SuperGuardian intelligently separates flows with high-cardinality from the data stream, and keeps the information of these flows with the large estimator, while using existing sketches with small estimators to record low-cardinality flows. We carry out a mathematical analysis for the cardinality estimation error of SuperGuardian. To validate our proposal, we have implemented SuperGuardian and conducted experimental evaluations using real traffic traces. The experimental results show that existing sketches using SuperGuardian reduce error by 79 % - 96 % and increase the throughput by 0.3–2.3 times.

References

[1]

M. Vartak, V. Raghavan, E.A. Rundensteiner, Qrelx: generating meaningful queries that provide cardinality assurance, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 1215–1218,.

Digital Library

[2]

L. Wang, et al., Fine-grained probability counting for cardinality estimation of data streams, World Wide Web 22 (5) (2019) 2065–2081,.

Digital Library

[3]

Q. Xiao, X. Hu, S. Chen, Supporting Flow-Cardinality Queries with O(1) Time Complexity in High-speed Networks, in: 2021 IEEE/ACM 29th Int. Symp. Qual. Serv. IWQOS 2021, 2021,.

[4]

H. Wang, C. Ma, O.O. Odegbile, S. Chen, J.K. Peir, Randomized error removal for online spread estimation in data streaming, Proc. VLDB Endow 14 (6) (2021) 1040–1052,.

Digital Library

[5]

K.Y. Whang, B.T. Vander-Zanden, H.M. Taylor, A Linear-Time Probabilistic Counting Algorithm for Database Applications, ACM Trans. Database Syst. 15 (2) (1990) 208–229,.

Digital Library

[6]

M. Durand, P. Flajolet, LogLog Counting of Large Cardinalities, in: European Symposium on Algorithms, 2003, pp. 605–617.

[7]

P. Flajolet, E. Fusy, O. Gandouet, F. Meunier, HyperLogLog : the analysis of a near-optimal cardinality estimation algorithm, Discret. Math. Theor. Comput. Sci. (2015) [Online]. Available: https://hal.archives-ouvertes.fr/hal-00406166/.

[8]

Y. Zhou, Y. Zhang, C. Ma, S. Chen, O.O. Odegbile, Generalized Sketch Families for Network Traffic Measurement, in: Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2019, pp. 1–34,.

Digital Library

[9]

P. Jia, et al., Accurately Estimating User Cardinalities and Detecting Super Spreaders over Time, IEEE Trans. Knowl. Data Eng. PP (c) (2020) 1,. 1.

Digital Library

[10]

C. Ma, G.S. Member, S. Chen, Y. Zhang, Q. Xiao, O.O. Odegbile, Super Spreader Identification Using Geometric-Min Filter, IEEE/ACM Trans. Netw. (2021) 1–14.

[11]

L. Tang, Q. Huang, P.P.C. Lee, SpreadSketch : toward Invertible and Network-Wide Detection of Superspreaders, in: IEEE INFOCOM 2020-39th IEEE International Conference on Computer Communications, 2020.

[12]

T. Benson, A. Akella, D.A. Maltz, Network traffic characteristics of data centers in the wild, in: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, 2010, pp. 267–280.

[13]

Y. Li, R. Miao, C. Kim, M. Yu, Flowradar: a better netflow for data centers, in: Proc. 13th USENIX Symp. Networked Syst. Des. Implementation, NSDI 2016, 2016, pp. 311–324.

[14]

“SuperGuardian.” https://anonymous.4open.science/r/SuperGuardian-CE4F/README.md.

[15]

M. Cai, J. Pan, Y.K. Kwok, K. Hwang, Fast and accurate traffic matrix measurement using adaptive cardinality counting, in: Proceedings of ACM SIGCOMM 2005 Workshop on Mining Network Data, MineNet 2005, 2005, pp. 205–206,.

Digital Library

[16]

J. Gong, et al., HeavyKeeper: an accurate algorithm for finding top-k elephant flows, in: Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018, 2018, pp. 909–921,.

Digital Library

[17]

L. Jie, C. Hongchang, S. Penghao, H. Tao, Z. Zhen, OrderSketch: an unbiased and fast sketch for frequency estimation of data streams, Comput. Netw. (2021).

[18]

J. Li, et al., WavingSketch: an Unbiased and Generic Sketch for Finding Top-k Items in Data Streams, in: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min, 2020, pp. 1574–1584,.

Digital Library

[19]

T. Yang, et al., Elastic sketch: adaptive and fast network-wide measurements, in: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Association for Computing Machinery, Inc, Aug. 2018, pp. 561–575,.

Digital Library

[20]

Y. Zhang, et al., CocoSketch: high-performance sketch-based measurement over arbitrary partial key query, in: SIGCOMM 2021 - Proc. ACM SIGCOMM 2021 Conf, 2021, pp. 207–222,.

Digital Library

[21]

Z. Zhong, S. Yan, Z. Li, D. Tan, T. Yang, B. Cui, BurstSketch: finding Bursts in Data Streams, in: Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21), Association for Computing Machinery, 2021,.

Digital Library

[22]

Y. Zhou, et al., Cold filter: a meta-framework for faster and more accurate stream processing, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2018, pp. 741–756,.

Digital Library

[23]

K. Yang, et al., SketchINT: empowering INT with TowerSketch for Per-flow Per-switch Measurement, in: Proc. - Int. Conf. Netw. Protoc. ICNP, 2021,. 2021-Novem.

[24]

H. Li, Q. Chen, Y. Zhang, T. Yang, B. Cui, Stingy Sketch: a Sketch Framework for Accurate and Fast Frequency Estimation, Proc. VLDB Endow 15 (7) (2022) 1426–1438,.

Digital Library

[25]

S. Heule, M. Nunkesser, A. Hall, HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm, ACM Int. Conf. Proc. Ser. (2013) 683–692,.

Digital Library

[26]

G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch and its applications, J. Algorithms 55 (1) (2005) 58–75,.

Digital Library

[27]

Q. Xiao, S. Chen, M. Chen, Y. Ling, Hyper-compact virtual estimators for big network data based on register sharing, ACM SIGMETRICS Perform. Eval. Rev. 43 (1) (2015) 417–428,.

Digital Library

[28]

M.K. Yoon, T. Li, S. Chen, J.K. Peir, Fit a spread estimator in small memory, in: Proc. - IEEE INFOCOM, 2009, pp. 504–512,.

[29]

P. Lieven, B. Scheuermann, High-speed per-flow traffic measurement with probabilistic multiplicity counting, in: IEEE INFOCOM 2010-29th IEEE Int. Conf. Comput. Commun, 2010,.

[30]

R.S. Boyer, J.S. Moore, MJRTY—a fast majority vote algorithm, in: Automated Reasoning, Springer, 1991, pp. 105–117.

[31]

“ecommerce dataset” https://www.kaggle.com/retailrocket/ecommerce-dataset?select=events.csv.

[32]

“CAIDA. 2016. The CAIDA UCSD Anonymized Internet Traces equinix-chicago.” http://www.caida.org/data/passive/passive_2016_dataset.xml.

Index Terms

SuperGuardian: Superspreader removal for cardinality estimation in data streaming

Index terms have been assigned to the content through auto-classification.

Recommendations

Virtual self-adaptive bitmap for online cardinality estimation
Abstract
Cardinality estimation is the task of obtaining the number of distinct items in a data stream, which plays an important role in many application domains. However, when dealing with high-speed data streams, it remains a significant ...
MCSketch: An Accurate Sketch for Heavy Flow Detection and Heavy Flow Frequency Estimation
Web and Big Data
Abstract
Accurately finding heavy flows in data streams is challenging owing to limited memory availability. Prior algorithms have focused on accuracy in heavy flow detection but cannot provide the frequency of a heavy flow exactly. In this paper, we ...
DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams
WWW '22: Proceedings of the ACM Web Conference 2022

Finding special items, like heavy hitters, top-k, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according ...

Comments

Information & Contributors

Information

Published In

cover image Information Systems

Information Systems Volume 122, Issue C

May 2024

192 pages

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 May 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents