Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SuperGuardian: : Superspreader removal for cardinality estimation in data streaming

Published: 01 May 2024 Publication History

Abstract

Measuring flow cardinality is one of the fundamental problems in data stream mining, where a data stream is modeled as a sequence of items from different flows and the cardinality of a flow is the number of distinct items in the flow. Many existing sketches based on estimator sharing have been proposed to deal with huge flows in data streams. However, these sketches suffer from inefficient memory usage due to allocating the same memory size for each estimator without considering the skewed cardinality distribution. To address this issue, we propose SuperGuardian to improve the memory efficiency of existing sketches. SuperGuardian intelligently separates flows with high-cardinality from the data stream, and keeps the information of these flows with the large estimator, while using existing sketches with small estimators to record low-cardinality flows. We carry out a mathematical analysis for the cardinality estimation error of SuperGuardian. To validate our proposal, we have implemented SuperGuardian and conducted experimental evaluations using real traffic traces. The experimental results show that existing sketches using SuperGuardian reduce error by 79 % - 96 % and increase the throughput by 0.3–2.3 times.

References

[1]
M. Vartak, V. Raghavan, E.A. Rundensteiner, Qrelx: generating meaningful queries that provide cardinality assurance, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 1215–1218,.
[2]
L. Wang, et al., Fine-grained probability counting for cardinality estimation of data streams, World Wide Web 22 (5) (2019) 2065–2081,.
[3]
Q. Xiao, X. Hu, S. Chen, Supporting Flow-Cardinality Queries with O(1) Time Complexity in High-speed Networks, in: 2021 IEEE/ACM 29th Int. Symp. Qual. Serv. IWQOS 2021, 2021,.
[4]
H. Wang, C. Ma, O.O. Odegbile, S. Chen, J.K. Peir, Randomized error removal for online spread estimation in data streaming, Proc. VLDB Endow 14 (6) (2021) 1040–1052,.
[5]
K.Y. Whang, B.T. Vander-Zanden, H.M. Taylor, A Linear-Time Probabilistic Counting Algorithm for Database Applications, ACM Trans. Database Syst. 15 (2) (1990) 208–229,.
[6]
M. Durand, P. Flajolet, LogLog Counting of Large Cardinalities, in: European Symposium on Algorithms, 2003, pp. 605–617.
[7]
P. Flajolet, E. Fusy, O. Gandouet, F. Meunier, HyperLogLog : the analysis of a near-optimal cardinality estimation algorithm, Discret. Math. Theor. Comput. Sci. (2015) [Online]. Available: https://hal.archives-ouvertes.fr/hal-00406166/.
[8]
Y. Zhou, Y. Zhang, C. Ma, S. Chen, O.O. Odegbile, Generalized Sketch Families for Network Traffic Measurement, in: Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2019, pp. 1–34,.
[9]
P. Jia, et al., Accurately Estimating User Cardinalities and Detecting Super Spreaders over Time, IEEE Trans. Knowl. Data Eng. PP (c) (2020) 1,. 1.
[10]
C. Ma, G.S. Member, S. Chen, Y. Zhang, Q. Xiao, O.O. Odegbile, Super Spreader Identification Using Geometric-Min Filter, IEEE/ACM Trans. Netw. (2021) 1–14.
[11]
L. Tang, Q. Huang, P.P.C. Lee, SpreadSketch : toward Invertible and Network-Wide Detection of Superspreaders, in: IEEE INFOCOM 2020-39th IEEE International Conference on Computer Communications, 2020.
[12]
T. Benson, A. Akella, D.A. Maltz, Network traffic characteristics of data centers in the wild, in: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, 2010, pp. 267–280.
[13]
Y. Li, R. Miao, C. Kim, M. Yu, Flowradar: a better netflow for data centers, in: Proc. 13th USENIX Symp. Networked Syst. Des. Implementation, NSDI 2016, 2016, pp. 311–324.
[15]
M. Cai, J. Pan, Y.K. Kwok, K. Hwang, Fast and accurate traffic matrix measurement using adaptive cardinality counting, in: Proceedings of ACM SIGCOMM 2005 Workshop on Mining Network Data, MineNet 2005, 2005, pp. 205–206,.
[16]
J. Gong, et al., HeavyKeeper: an accurate algorithm for finding top-k elephant flows, in: Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018, 2018, pp. 909–921,.
[17]
L. Jie, C. Hongchang, S. Penghao, H. Tao, Z. Zhen, OrderSketch: an unbiased and fast sketch for frequency estimation of data streams, Comput. Netw. (2021).
[18]
J. Li, et al., WavingSketch: an Unbiased and Generic Sketch for Finding Top-k Items in Data Streams, in: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min, 2020, pp. 1574–1584,.
[19]
T. Yang, et al., Elastic sketch: adaptive and fast network-wide measurements, in: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Association for Computing Machinery, Inc, Aug. 2018, pp. 561–575,.
[20]
Y. Zhang, et al., CocoSketch: high-performance sketch-based measurement over arbitrary partial key query, in: SIGCOMM 2021 - Proc. ACM SIGCOMM 2021 Conf, 2021, pp. 207–222,.
[21]
Z. Zhong, S. Yan, Z. Li, D. Tan, T. Yang, B. Cui, BurstSketch: finding Bursts in Data Streams, in: Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21), Association for Computing Machinery, 2021,.
[22]
Y. Zhou, et al., Cold filter: a meta-framework for faster and more accurate stream processing, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2018, pp. 741–756,.
[23]
K. Yang, et al., SketchINT: empowering INT with TowerSketch for Per-flow Per-switch Measurement, in: Proc. - Int. Conf. Netw. Protoc. ICNP, 2021,. 2021-Novem.
[24]
H. Li, Q. Chen, Y. Zhang, T. Yang, B. Cui, Stingy Sketch: a Sketch Framework for Accurate and Fast Frequency Estimation, Proc. VLDB Endow 15 (7) (2022) 1426–1438,.
[25]
S. Heule, M. Nunkesser, A. Hall, HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm, ACM Int. Conf. Proc. Ser. (2013) 683–692,.
[26]
G. Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch and its applications, J. Algorithms 55 (1) (2005) 58–75,.
[27]
Q. Xiao, S. Chen, M. Chen, Y. Ling, Hyper-compact virtual estimators for big network data based on register sharing, ACM SIGMETRICS Perform. Eval. Rev. 43 (1) (2015) 417–428,.
[28]
M.K. Yoon, T. Li, S. Chen, J.K. Peir, Fit a spread estimator in small memory, in: Proc. - IEEE INFOCOM, 2009, pp. 504–512,.
[29]
P. Lieven, B. Scheuermann, High-speed per-flow traffic measurement with probabilistic multiplicity counting, in: IEEE INFOCOM 2010-29th IEEE Int. Conf. Comput. Commun, 2010,.
[30]
R.S. Boyer, J.S. Moore, MJRTY—a fast majority vote algorithm, in: Automated Reasoning, Springer, 1991, pp. 105–117.
[32]
“CAIDA. 2016. The CAIDA UCSD Anonymized Internet Traces equinix-chicago.” http://www.caida.org/data/passive/passive_2016_dataset.xml.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Systems
Information Systems  Volume 122, Issue C
May 2024
192 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 May 2024

Author Tags

  1. Data stream mining
  2. Cardinality estimation
  3. Superspreader
  4. Sketch

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media