research-article

Open access

SQUID: Faster Analytics via Sampled Quantile Estimation

Authors:

Bilal TayhAuthors Info & Claims

Proceedings of the ACM on Networking, Volume 2, Issue CoNEXT3

Article No.: 19, Pages 1 - 23

https://doi.org/10.1145/3676873

Published: 21 August 2024 Publication History

Abstract

Streaming algorithms are fundamental in the analysis of large and online datasets. A key component of many such analytic tasks is q-MAX, which finds the largest q values in a number stream. Modern approaches attain a constant runtime by removing small items in bulk and retaining the largest q items at all times. Yet, these approaches are bottlenecked by an expensive quantile calculation. This work introduces a quantile-sampling approach called SQUID and shows its benefits in multiple analytic tasks. Using this approach, we design a novel weighted heavy hitters data structure that is faster and more accurate than the existing alternatives. We also show SQUID's practicality for improving network-assisted caching systems with a hardware-based cache prototype that uses SQUID to implement the cache policy. The challenge here is that the switch's dataplane does not allow the general computation required to implement many cache policies, while its CPU is orders of magnitude slower. We overcome this issue by passing just SQUID's samples to the CPU, thus bridging this gap. In software implementations, we show that our method is up to 6.6x faster than the state-of-the-art alternatives when using real workloads. For switch-based caching, SQUID enables a wide spectrum of data-plane-based caching policies and achieves higher hit ratios than the state-of-the-art P4LRU.

References

[1]

Intel® tofino series programmable ethernet switch asic. https://www.intel.com/content/www/us/en/products/ network-io/programmable-ethernet-switch.html.

[2]

Squid's open source code. https://github.com/SQUID12/SQUID.

[3]

The CAIDA UCSD Anonymized Internet Traces 2016 - January. 21st.

[4]

The CAIDA UCSD Anonymized Internet Traces 2018 - equinix-nyc 2018-03--15, Direction A. https://www.caida.org/data/monitors/passive-equinix-nyc.xml.

[5]

S. Abdous, E. Sharafzadeh, and S. Ghorbani. Practical packet deflection in datacenters. Proceedings of the ACM on Networking, 1(CoNEXT3):1--25, 2023.

Digital Library

[6]

M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, A. Vahdat, et al. Hedera: dynamic flow scheduling for data center networks. In USENIX NSDI. San Jose, USA, 2010.

[7]

Amazon. Redshift. https://aws.amazon.com/redshift/.

[8]

D. Anderson, P. Bevan, K. Lang, E. Liberty, L. Rhodes, and J. Thaler. A high-performance algorithm for identifying frequent items in data streams. In ACM IMC, 2017.

Digital Library

[9]

Apache. Spark. https://spark.apache.org/.

[10]

Z. B Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, 2002.

[11]

R. Basat, G. Einziger, R. Friedman, M. Luizelli, and E. Waisbard. Constant time updates in hierarchical heavy hitters. In ACM SIGCOMM, 2017.

Digital Library

[12]

R. B. Basat, G. Einziger, R. Friedman, and Y. Kassner. Optimal elephant flow detection. In IEEE INFOCOM, 2017.

[13]

R. B. Basat, G. Einziger, R. Friedman, and Y. Kassner. Randomized admission policy for efficient top-k and frequency estimation. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pages 1--9. IEEE, 2017.

[14]

R. B. Basat, G. Einziger, J. Gong, J. Moraney, and D. Raz. ??-MAX: A Unified Scheme for Improving Network Measurement Throughput. In Proceedings of the Internet Measurement Conference, IMC 2019, Amsterdam, The Netherlands, October 21--23, 2019, pages 322--336. ACM, 2019.

Digital Library

[15]

R. B. Basat, G. Einziger, I. Keslassy, A. Orda, S. Vargaftik, and E. Waisbard. Memento: Making sliding windows efficient for heavy hitters. In Proceedings of the 14th International Conference on Emerging Networking EXperiments and Technologies, pages 254--266, 2018.

Digital Library

[16]

R. B. Basat, G. Einziger, M. C. Luizelli, and E. Waisbard. A black-box method for accelerating measurement algorithms with accuracy guarantees. In IFIP Networking Conference, pages 1--9. IEEE, 2019.

[17]

R. B. Basat, G. Einziger, M. Mitzenmacher, and S. Vargaftik. Faster and more accurate measurement through additiveerror counters. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 1251--1260. IEEE, 2020.

Digital Library

[18]

R. B. Basat, G. Einziger, M. Mitzenmacher, and S. Vargaftik. Salsa: Self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 864--875. IEEE, 2021.

[19]

R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich. Efficient measurement on programmable switches using probabilistic recirculation. In 2018 IEEE 26th International Conference on Network Protocols (ICNP), pages 313--323. IEEE, 2018.

[20]

R. Ben-Basat, G. Einziger, and R. Friedman. Space efficient elephant flow detection. In ACM SYSTOR, 2018.

Digital Library

[21]

R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. Heavy hitters in streams and sliding windows. In IEEE INFOCOM, 2016.

Digital Library

[22]

R. Ben-Basat, G. Einziger, W. Han, and B. Tayh. Squid: Faster analytics via sampled quantiles data-structure. https: //arxiv.org/abs/2211.01726, 2022.

[23]

R. Ben Basat, G. Einziger, J. Moraney, and D. Raz. Network-wide routing oblivious heavy hitters. In ACM/IEEE ANCS, 2018.

Digital Library

[24]

T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In ACM IMC, 2010.

Digital Library

[25]

M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. J. of Computer and System Sciences, 1973.

[26]

P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for sdn. ACM SIGCOMM Computer Communication Review, 43(4):99--110, 2013.

Digital Library

[27]

BROADCOM. Trident Programmable Switch. https://www.broadcom.com/products/ethernet-connectivity/switching/ strataxgs/bcm56870-series, 2017.

[28]

M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. of the 29th International Colloquium on Automata, Languages and Programming, ICALP. Springer-Verlag, 2002.

Digital Library

[29]

G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proc. VLDB Endow., 1(2):1530--1541, Aug. 2008. Code: www.research.att.com/marioh/frequent-items.html.

Digital Library

[30]

G. Cormode and S. Muthukrishnan. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In In Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data, 2004.

Digital Library

[31]

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55, 2004.

[32]

G. Cormode and S. Muthukrishnan. What's hot and what's not: Tracking most frequent items dynamically. ACM Trans. Database Syst., 2005.

Digital Library

[33]

E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. of the 10th Annual European Symposium on Algorithms, ESA. Springer-Verlag, 2002.

[34]

N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6):32--es, Dec. 2007.

Digital Library

[35]

N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 2007.

Digital Library

[36]

N. Duffield, Y. Xu, L. Xia, N. K. Ahmed, and M. Yu. Stream aggregation through order sampling. In ACM CIKM, 2017.

Digital Library

[37]

G. Einziger, O. Eytan, R. Friedman, and B. Manes. Adaptive software cache management. In Middleware. Association for Computing Machinery, 2018.

[38]

Google. Bigquery. https://cloud.google.com/bigquery.

[39]

V. M. Gottin, E. Pacheco, J. Dias, A. E. M. Ciarlini, B. Costa, W. Vieira, Y. M. Souto, P. Pires, F. Porto, and J. a. G. Rittmeyer. Automatic caching decision for scientific dataflow execution in apache spark. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR'18, New York, NY, USA, 2018. Association for Computing Machinery.

Digital Library

[40]

R. Harrison, Q. Cai, A. Gupta, and J. Rexford. Network-wide heavy hitter detection with commodity switches. In ACM SOSR, 2018.

Digital Library

[41]

C. A. R. Hoare. Algorithm 65: Find. Commun. ACM, 1961.

[42]

Q. Huang, X. Jin, P. P. C. Lee, R. Li, L. Tang, Y.-C. Chen, and G. Zhang. Sketchvisor: Robust network measurement for software packet processing. In ACM SIGCOMM, 2017.

Digital Library

[43]

Intel. Intel® 64 and ia-32 architectures software developer's manual. https://www.intel.com/content/dam/www/ public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf.

[44]

S. Jiang and X. Zhang. Lirs: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. ACM SIGMETRICS, pages 31--42, 2002.

Digital Library

[45]

X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica. Netcache: Balancing key-value stores with fast in-network caching. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 121--136, 2017.

Digital Library

[46]

R. Karedla, J. S. Love, and B. G. Wherry. Caching strategies to improve disk system performance. Computer, 27(3):38--46, Mar. 1994.

Digital Library

[47]

E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. In In SIROCCO, 2003.

[48]

D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. Lrfu: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. on Comp., 2001.

[49]

X. Li, R. Sethi, M. Kaminsky, D. G. Andersen, and M. J. Freedman. Be fast, cheap and in control with ???????????????. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pages 31--44, 2016.

[50]

Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, et al. Hpcc: High precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, pages 44--58, 2019.

Digital Library

[51]

M. Liberatore and P. Shenoy. Umass trace repository. http://traces.cs.umass.edu/index.php/Main/About, 2016.

[52]

Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, and I. Stoica. Distcache: Provable load balancing for large-scale storage systems with distributed caching. In USENIX FAST. USENIX Association, Feb. 2019.

[53]

Z. Liu, R. Ben-Basat, G. Einziger, Y. Kassner, V. Braverman, R. Friedman, and V. Sekar. Nitrosketch: Robust and general sketch-based monitoring in software switches. In ACM SIGCOMM, 2019.

Digital Library

[54]

Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In ACM SIGCOMM, 2016.

Digital Library

[55]

B. Manes. Caffeine: A high performance caching library for java 8. https://github.com/ben-manes/caffeine, 2016.

[56]

N. Megiddo and D. S. Modha. Arc: A self-tuning, low overhead replacement cache. In USENIX FAST, 2003.

[57]

A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In IN ICDT, 2005.

[58]

J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143--152, 1982.

[59]

M. Mitzenmacher. Some open questions related to cuckoo hashing. In European Symposium on Algorithms, pages 1--10. Springer, 2009.

[60]

S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh, V. Jeyakumar, and C. Kim. Language-directed hardware design for network performance monitoring. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 85--98, 2017.

Digital Library

[61]

R. Neugebauer, G. Antichi, J. F. Zazo, Y. Audzevich, S. López-Buedo, and A.W. Moore. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 327--341, 2018.

Digital Library

[62]

R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004.

Digital Library

[63]

S. Park and C. Park. Frd: A filtering based buffer cache algorithm that considers both frequency and reuse distance. In Proc. of the 33rd IEEE International Conference on Massive Storage Systems and Technology (MSST), 2017.

[64]

R. Shahout, R. Friedman, and R. Ben Basat. Together is better: Heavy hitters quantile estimation. Proceedings of the ACM on Management of Data, 1(1):1--25, 2023.

Digital Library

[65]

S. G. Sáez, V. Andrikopoulos, F. Leymann, and S. Strauch. Evaluating caching strategies for cloud data access using an enterprise service bus. In 2014 IEEE International Conference on Cloud Engineering, pages 289--294, 2014.

Digital Library

[66]

M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu. Cheetah: Accelerating database queries with switch pruning. In ACM SIGMOD, pages 2407--2422, 2020.

Digital Library

[67]

T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li, and S. Uhlig. Elastic sketch: adaptive and fast network-wide measurements. In SIGCOMM, 2018.

Digital Library

[68]

Z. Yu, C. Hu, J. Wu, X. Sun, V. Braverman, M. Chowdhury, Z. Liu, and X. Jin. Programmable packet scheduling with a single queue. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 179--193, 2021.

Digital Library

[69]

C. Zeng, L. Luo, T. Zhang, Z. Wang, L. Li, W. Han, N. Chen, L. Wan, L. Liu, Z. Ding, et al. Tiara: A scalable and efficient hardware acceleration architecture for stateful layer-4 load balancing. In USENIX NSDI, 2022.

[70]

Y. Zhao, W. Liu, F. Dong, T. Yang, Y. Li, K. Yang, Z. Liu, Z. Jia, and Y. Yang. P4lru: Towards an lru cache entirely in programmable data plane. In Proceedings of the ACM SIGCOMM 2023 Conference, pages 967--980, 2023.

Digital Library

Index Terms

SQUID: Faster Analytics via Sampled Quantile Estimation
1. Networks

Recommendations

Data Sketching for Real Time Analytics: Theory and Practice
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Speed, cost, and scale. These are 3 of the biggest challenges in analyzing big data. While modern data systems continue to push the boundaries of scale, the problems of speed and cost are fundamentally tied to the size of data being scanned or ...
User-assisted in-network caching in information-centric networking

In information-centric networking, in-network caching has the potential to improve network efficiency and content distribution performance by satisfying user requests with cached content rather than downloading the requested content from remote sources. ...
MuNCC: Multi-hop Neighborhood Collaborative Caching in Information Centric Networks
ACM-ICN '16: Proceedings of the 3rd ACM Conference on Information-Centric Networking

Caching strategies in Information-Centric Networks (ICNs) can be classified into the categories of individual caching, on-path caching, and collaborative caching. Each has several drawbacks, such as high content redundancy in individual caching, ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Networking

Proceedings of the ACM on Networking Volume 2, Issue CoNEXT3

PACMNET

September 2024

108 pages

EISSN:2834-5509

DOI:10.1145/3689614

Editors:
Marco Mellia
Politecnico di Torino, Italy
,
Peter Steenkiste
Carnegie Mellon University, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2024

Published in PACMNET Volume 2, Issue CoNEXT3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
72
Total Downloads

Downloads (Last 12 months)72
Downloads (Last 6 weeks)72

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents