Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3589334.3645581acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

Stable-Sketch: A Versatile Sketch for Accurate, Fast, Web-Scale Data Stream Processing

Published: 13 May 2024 Publication History

Abstract

Data stream processing plays a pivotal role in various web-related applications, including click fraud detection, anomaly identification, and recommendation systems. Accurate and fast detection of items relevant to such tasks within data streams, e.g., heavy hitters, heavy changers, and persistent items, is however non-trivial. This is due to growing streaming speeds, limited fast memory (L1 cache) available in current systems, and highly skewed item distributions encountered in practice. In effect, items of interest that are tracked only based on their features (e.g., item frequency or persistence value) are susceptible to replacement by non-relevant ones, leading to modest detection accuracy, as we reveal. In this work, we introduce the notion of bucket stability, which quantifies the degree of recorded item variation, and show that this is a powerful metric for identifying distinct item types. We propose Stable-Sketch, an elegant and versatile sketch that exploits multidimensional information, including item statistics and bucket stability, and adopts a stochastic approach to drive replacement decisions. We present a theoretical analysis of the error bounds of Stable-Sketch, and conduct extensive experiments to demonstrate that our solution achieves substantially higher accuracy and faster processing speeds than state-of-the-art sketches in a range of item detection tasks, even with tight memories. We further enhance Stable-Sketch's update throughput with Single Instruction Multiple Data (SIMD) instructions and implement our solution with P4, demonstrating real world deployment viability.

Supplemental Material

MP4 File
video presentation
MP4 File
Supplemental video

References

[1]
G.Wang, X. Zhang, S. Tang, H. Zheng, and B. Y. Zhao, ?Unsupervised Clickstream Clustering for User Behavior Analysis," in Proceedings of ACM CHI, 2016.
[2]
M. Eirinaki, and M. Vazirgiannis, "Web Mining for Web Personalization," ACM Transactions on Internet Technology, vol. 3, no. 1, pp. 1--27, 2003.
[3]
P.G. Teodoro, J.D. Verdejo, G.M. Fernandez, and E. Vazquez, "Anomaly-based Network Intrusion Detection: Techniques, Systems and Challenges," Computers & Security, vol. 28, no. 1, pp. 18--28, 2009.
[4]
S. Feghhi and D. J. Leith, "A Web Traffic Analysis Attack Using Only Timing Information," IEEE Transactions on Information Forensics and Security, vol. 11, no. 8, pp. 1747--1759, 2016.
[5]
R.B. Basat, G. Einziger, R. Friedman, and Y. Kassner, "Heavy Hitters in Streams and Sliding Windows," in Proceedings of IEEE INFOCOM, 2016.
[6]
G. Cormode, S. Muthukrishnan, "An Improved Data Stream Summary: The Count- Min Sketch and its Applications," Journal of Algorithms, vol. 55, no. 1, pp. 58--75, 2005.
[7]
L. Tang, Q. Huang, and P.P.C. Lee, "MV-Sketch: A Fast and Compact Invertible Sketch for Heavy item Detection in Network Data Streams," in Proceedings of IEEE INFOCOM, 2019.
[8]
J. Gong, T. Yang, H. Zhang, H. Li, S. Uhlig, S. Chen, L.Uden, and X. Li, "Heavy- Keeper: An Accurate Algorithm for Finding Top-k Elephant Items," in Proceedings of USENIX ATC, 2018.
[9]
T. Yang, S. Gao, Z. Sun, Y. Wang, Y. Shen and X. Li, "Diamond Sketch: Accurate Per-flow Measurement for Big Streaming Data," in IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 12, pp. 2650--2662, 2019.
[10]
P. Roy, A. Khan, and G. Alonso, "Augmented Sketch: Faster and More Accurate Stream Processing," in Proceedings of ACM SIGMOD, 2016.
[11]
B. Zhao, X. Li, B. Tian, Z. Mei, and W. Wu, "DHS: Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate Data Stream Processing," in Proceedings of ACM KDD, 2021.
[12]
R. B. Basat, G. Einziger, M. Mitzenmacher and S. Vargaftik, "SALSA: Self-Adjusting Lean Streaming Analytics," in Proceedings of IEEE ICDE, 2021.
[13]
Q. Xiao, H.Wang, and G. Pan, "Accurately Identify Time-decaying Heavy Hitters by Decay-aware Cuckoo Filter along Kicking Path," in Proceedings of IEEE/ACM IWQoS, 2022.
[14]
Y. Li, R. Miao, C. Kim, and M. Yu, "itemRadar: A Better Netitem for Data Centers," in Proceedings of USENIX NSDI, 2016.
[15]
B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, "Sketch-based Change Detection: Methods, Evaluation, and Applications," in Proceedings of ACM IMC, 2003.
[16]
R. Schweller, Z. Li, Y. Chen, Y. Gao, A. Gupta, Y. Zhang, P.A. Dinda, M. Kao, and G. Memik, "Reversible Sketches: Enabling Monitoring and Analysis Over High-Speed Data Streams," IEEE/ACM Transactions on Networking, vol. 15, no. 5, pp. 1059--1072, 2007.
[17]
Y. Zhang, J. Li, Y. Lei, T. Yang, Z. Li, G. Zhang, and B. Cui, "On-Off Sketch: A Fast and Accurate Sketch on Persistence," in Proceedings of VLDB Endowment, 2020.
[18]
B. Lahiri, J. Chandrashekar, and S. Tirthapura, "Space-efficient Tracking of Persistent Items in a Massive Data Stream," in Proceedings of ACM DEBS, 2011.
[19]
H. Dai, M. Shahzad, A.X. Liu, and Y. Zhong, "Finding Persistent Items in Data Streams," in Proceedings of VLDB Endowment, 2016.
[20]
W. Li and P. Patras, "P-Sketch: A Fast and Accurate Sketch for Persistent Item Lookup," in IEEE/ACM Transactions on Networking.
[21]
T. Yang, J. Gong, H. Zhang, L. Zou, L. Shi, and X. Li, "HeavyGuardian: Separate and Guard Hot Items in Data Streams," in Proceedings of ACM KDD, 2018.
[22]
J. Li, Z. Li, Y. Xu, S. Jiang, T. Yang, B. Cui, Y. Dai, and G. Zhang, "WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams," in Proceedings of ACM KDD, 2020.
[23]
B.H. Bloom, "Space/time Trade-offs in Hash Coding with Allowable Errors," Communications of the ACM, vol. 13, no. 7, pp. 422--426, 1970.
[24]
T. Benson, A. Akella, and D.A. Maltz, "Network Traffic Characteristics of Data Centers in the Wild," in Proceedings of ACM SIGCOMM, 2010.
[25]
J. Zhang, F.R. Yu, S. Wang, T. Huang, Z. Liu, and Y. Liu, "Load Balancing in Data Center Networks: A Survey," IEEE Communications Surveys & Tutorials, vol. 20, no. 3, pp. 2324--2352, 2018.
[26]
W. Li, and P. Patras, "Tight-Sketch: A High-Performance Sketch for Heavy Item- Oriented Data Stream Mining with Limited Memory Size," in Proceedings of ACM CIKM, 2023.
[27]
H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury, ?Resilient Datacenter Load Balancing in the Wild," in Proceedings of ACM SIGCOMM, 2017.
[28]
Q. Huang, and P.P.C. Lee, ?A Hybrid Local and Distributed Sketching Design for Accurate and Scalable Heavy Key Detection in Network Data Streams," Computer Networks, vol. 91, no. 1, pp. 1--18, 2015.
[29]
L. Tang, Q. Huang and P.P.C. Lee, "SpreadSketch: Toward Invertible and Network- Wide Detection of Superspreaders," in Proceedings of IEEE INFOCOM, 2020.
[30]
Y. Zhang, Z. Liu, R.Wang, T. Yang, J. Li, R. Mao, P. Liu, R. Zhang, and J. Jiang, "CocoSketch: High-Performance Sketch-based Measurement over Arbitrary Partial Key Query," in Proceedings of ACM SIGCOMM, 2021.
[31]
D. Ting, "Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation," in Proceedings of ACM SIGMOD, 2018.
[32]
Intel SSE2 Documentation. https://software.intel.com/en-us/node/683883.
[33]
"Stable-Sketch Repository," https://github.com/Mobile-Intelligence-Lab/Stable- Sketch.git
[34]
A. Metwally, D. Agrawal, and A.E. Abbadi, "Efficient Computation of Frequent and Top-k Elements in Data Streams," in Proceedings of Springer ICDT, 2005.
[35]
R.B. Basat, X. Chen, G. Einziger, R. Friedman, and Y. Kassner, "Randomized Admission Policy for Efficient Top-k, Frequency, and Volume Estimation," IEEE/ACM Transactions on Networking, vol. 27, no. 4, pp. 1432--1445, 2019.
[36]
R.B. Basat, X. Chen, G. Einziger, and O. Rottenstreich, "Designing Heavy-Hitter Detection Algorithms for Programmable Switches," IEEE/ACM Transactions on Networking, vol. 28, no. 3, pp. 1172--1185, 2020.
[37]
M. Charikar, K. Chen, and M.F. Colton, "Finding Frequent Items in Data Streams," in Proceedings of Springer ICALP, 2002.
[38]
"MV-Sketch Repository," https://github.com/Grace-TL/MV-Sketch.
[39]
J. Huang,W. Zhang, Y. Li, L. Li, Z. Li, J.Ye and, J.Wang, "ChainSketch: An Efficient and Accurate Sketch for Heavy Flow Detection," in IEEE/ACM Transactions on Networking, 2022.
[40]
T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li, and S. Uhlig, "Elastic Sketch: Adaptive and Fast Network-wide Measurements," in Proceedings of ACM SIGCOMM, 2018.
[41]
Y. Li, F. Wang, X. Yu, Y. Yang, K. Yang, T. Yang, Z. Ma, B. Cui, and S. Uhlig, "LadderFilter: Filtering Infrequent Items with Small Memory and Time Overhead," in Proceedings of ACM SIGMOD, 2023.
[42]
Y. Zhou, T. Yang, J. Jiang, B. Cui, M. Yu, X. Li, and S. Uhlig, "Cold Filter: A Meta- Framework for Faster and More Accurate Stream Processing," in Proceedings of ACM SIGMOD, 2018.
[43]
P. Jia, P. Wang, J. Zhao, Y. Yuan, J. Tao, and X. Guan, "LogLog Filter: Filtering Cold Items within a Large Range over High Speed Data Streams," in Proceedings of IEEE ICDE, 2021.
[44]
J. Ye, L. Li, W. Zhang, G. Chen, Y. Shan, Y. Li, W. Li, and J. Huang, "UA-Sketch: An Accurate Approach to Detect Heavy Flow based on Uninterrupted Arrival," in Proceedings of ACM ICPP, 2022.
[45]
S. Ghorbani, Z. Yang, P.B. Godfrey, Y. Ganjali and A. Firoozshahian, "DRILL: Micro Load Balancing for Low-latency Data Center Networks," in Proceedings of ACM SIGCOMM, 2017.
[46]
M. Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 10, pp. 1094- 1104, 2001.
[47]
H. Huang, Y.E. Sun, C. Ma, S. Chen, Y. Zhou, W. Yang, S. Tang, H. Xu, and Y. Qiao, "An Efficient ??-Persistent Spread Estimator for Traffic Measurement in High-Speed Networks," IEEE/ACM Transactions on Networking, vol. 28, no. 4, pp. 1463--1476, 2020.
[48]
Y. Du, H. Huang, Y.E. Sun, S. Chen, G. Gao, X.Wang, and S. Xu, ?Short-Term Memory Sampling for Spread Measurement in High-Speed Networks," in Proceedings of IEEE INFOCOM, 2002.
[49]
T. Yang, H. Zhang, D. Yang, Y. Huang, and X. Li, "Finding Significant Items in Data Streams," in Proceedings of IEEE ICDE, 2019.
[50]
Z. Zhong, S. Yan, Z. Li, D. Tan, T. Yang, and B. Cui, ?BurstSketch: Finding Bursts in Data Streams," in Proceedings of ACM SIGMOD, 2021.
[51]
P. Ernest, "Mathematical Induction: A Pedagogical Discussion," Educational Studies in Mathematics, vol. 15, no. 1, pp. 173--189, 1984.
[52]
"The CAIDA Anonymized Internet Traces," http://www.caida.org/data/overview/.
[53]
"MAWI Working Group Traffic Archive," http://mawi.wide.ad.jp/mawi/.
[54]
M. Singh; M. Singh, S. Kaur, "10 Days DNS Network Traffic from April-May, 2016," Mendeley Data, V2, 2019.
[55]
A. Shiravi, H. Shiravi, M. Tavallaee, and A.A. Ghorbani, "Toward Developing a Systematic Approach to Generate Benchmark Datasets for Intrusion Detection," Computers and Security, vol. 31, no. 3, pp. 357--374, 2012.
[56]
L. Tang, Q. Huang, and P.P.C. Lee, "A Fast and Compact Invertible Sketch for Network-Wide Heavy item Detection," IEEE/ACM Transactions on Networking, vol. 28, no. 5, pp. 2350--2363, 2020.
[57]
A. Appleby, "MurmurHash," https://sites.google.com/site/murmurhash/, 2011.
[58]
H. Namkung, Z. Liu, D. Kim, V. Sekar, and P. Steenkiste, "SketchLib: Enabling Efficient Sketch-based Monitoring on Programmable Switches," in Proceedings of USENIX NSDI, 2022.
[59]
P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker, "P4: Programming Protocol- Independent Packet Processors," ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87--95, 2014.
[60]
Intel, "Intel P4 Studio," https://www.intel.com/content/www/us/en/produ cts/network-io/programmable-ethernet-switch/p4-suite/p4-studio.html, 2022.
[61]
S. Brown, and J. Rose, "FPGA and CPLD Architectures: A Tutorial," IEEE Design & Test of Computers, vol. 13, no. 2, pp. 42--57, 1996.

Cited By

View all
  • (2025)TailoredSketch: A Fast and Adaptive Sketch for Efficient Per-Flow Size MeasurementIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.350390412:1(505-517)Online publication date: Jan-2025
  • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024

Index Terms

  1. Stable-Sketch: A Versatile Sketch for Accurate, Fast, Web-Scale Data Stream Processing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '24: Proceedings of the ACM Web Conference 2024
    May 2024
    4826 pages
    ISBN:9798400701719
    DOI:10.1145/3589334
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. bucket stability
    2. data stream
    3. heavy items
    4. persistent items
    5. sketch

    Qualifiers

    • Research-article

    Funding Sources

    • Cisco University Research Program Fund

    Conference

    WWW '24
    Sponsor:
    WWW '24: The ACM Web Conference 2024
    May 13 - 17, 2024
    Singapore, Singapore

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)675
    • Downloads (Last 6 weeks)124
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)TailoredSketch: A Fast and Adaptive Sketch for Efficient Per-Flow Size MeasurementIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.350390412:1(505-517)Online publication date: Jan-2025
    • (2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media