Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

SQUID: Faster Analytics via Sampled Quantile Estimation

Published: 21 August 2024 Publication History

Abstract

Streaming algorithms are fundamental in the analysis of large and online datasets. A key component of many such analytic tasks is q-MAX, which finds the largest q values in a number stream. Modern approaches attain a constant runtime by removing small items in bulk and retaining the largest q items at all times. Yet, these approaches are bottlenecked by an expensive quantile calculation. This work introduces a quantile-sampling approach called SQUID and shows its benefits in multiple analytic tasks. Using this approach, we design a novel weighted heavy hitters data structure that is faster and more accurate than the existing alternatives. We also show SQUID's practicality for improving network-assisted caching systems with a hardware-based cache prototype that uses SQUID to implement the cache policy. The challenge here is that the switch's dataplane does not allow the general computation required to implement many cache policies, while its CPU is orders of magnitude slower. We overcome this issue by passing just SQUID's samples to the CPU, thus bridging this gap. In software implementations, we show that our method is up to 6.6x faster than the state-of-the-art alternatives when using real workloads. For switch-based caching, SQUID enables a wide spectrum of data-plane-based caching policies and achieves higher hit ratios than the state-of-the-art P4LRU.

References

[1]
Intel® tofino series programmable ethernet switch asic. https://www.intel.com/content/www/us/en/products/ network-io/programmable-ethernet-switch.html.
[2]
Squid's open source code. https://github.com/SQUID12/SQUID.
[3]
The CAIDA UCSD Anonymized Internet Traces 2016 - January. 21st.
[4]
The CAIDA UCSD Anonymized Internet Traces 2018 - equinix-nyc 2018-03--15, Direction A. https://www.caida.org/data/monitors/passive-equinix-nyc.xml.
[5]
S. Abdous, E. Sharafzadeh, and S. Ghorbani. Practical packet deflection in datacenters. Proceedings of the ACM on Networking, 1(CoNEXT3):1--25, 2023.
[6]
M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, A. Vahdat, et al. Hedera: dynamic flow scheduling for data center networks. In USENIX NSDI. San Jose, USA, 2010.
[7]
Amazon. Redshift. https://aws.amazon.com/redshift/.
[8]
D. Anderson, P. Bevan, K. Lang, E. Liberty, L. Rhodes, and J. Thaler. A high-performance algorithm for identifying frequent items in data streams. In ACM IMC, 2017.
[9]
Apache. Spark. https://spark.apache.org/.
[10]
Z. B Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, 2002.
[11]
R. Basat, G. Einziger, R. Friedman, M. Luizelli, and E. Waisbard. Constant time updates in hierarchical heavy hitters. In ACM SIGCOMM, 2017.
[12]
R. B. Basat, G. Einziger, R. Friedman, and Y. Kassner. Optimal elephant flow detection. In IEEE INFOCOM, 2017.
[13]
R. B. Basat, G. Einziger, R. Friedman, and Y. Kassner. Randomized admission policy for efficient top-k and frequency estimation. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pages 1--9. IEEE, 2017.
[14]
R. B. Basat, G. Einziger, J. Gong, J. Moraney, and D. Raz. ??-MAX: A Unified Scheme for Improving Network Measurement Throughput. In Proceedings of the Internet Measurement Conference, IMC 2019, Amsterdam, The Netherlands, October 21--23, 2019, pages 322--336. ACM, 2019.
[15]
R. B. Basat, G. Einziger, I. Keslassy, A. Orda, S. Vargaftik, and E. Waisbard. Memento: Making sliding windows efficient for heavy hitters. In Proceedings of the 14th International Conference on Emerging Networking EXperiments and Technologies, pages 254--266, 2018.
[16]
R. B. Basat, G. Einziger, M. C. Luizelli, and E. Waisbard. A black-box method for accelerating measurement algorithms with accuracy guarantees. In IFIP Networking Conference, pages 1--9. IEEE, 2019.
[17]
R. B. Basat, G. Einziger, M. Mitzenmacher, and S. Vargaftik. Faster and more accurate measurement through additiveerror counters. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 1251--1260. IEEE, 2020.
[18]
R. B. Basat, G. Einziger, M. Mitzenmacher, and S. Vargaftik. Salsa: Self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 864--875. IEEE, 2021.
[19]
R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich. Efficient measurement on programmable switches using probabilistic recirculation. In 2018 IEEE 26th International Conference on Network Protocols (ICNP), pages 313--323. IEEE, 2018.
[20]
R. Ben-Basat, G. Einziger, and R. Friedman. Space efficient elephant flow detection. In ACM SYSTOR, 2018.
[21]
R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. Heavy hitters in streams and sliding windows. In IEEE INFOCOM, 2016.
[22]
R. Ben-Basat, G. Einziger, W. Han, and B. Tayh. Squid: Faster analytics via sampled quantiles data-structure. https: //arxiv.org/abs/2211.01726, 2022.
[23]
R. Ben Basat, G. Einziger, J. Moraney, and D. Raz. Network-wide routing oblivious heavy hitters. In ACM/IEEE ANCS, 2018.
[24]
T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In ACM IMC, 2010.
[25]
M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. J. of Computer and System Sciences, 1973.
[26]
P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for sdn. ACM SIGCOMM Computer Communication Review, 43(4):99--110, 2013.
[27]
BROADCOM. Trident Programmable Switch. https://www.broadcom.com/products/ethernet-connectivity/switching/ strataxgs/bcm56870-series, 2017.
[28]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. of the 29th International Colloquium on Automata, Languages and Programming, ICALP. Springer-Verlag, 2002.
[29]
G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proc. VLDB Endow., 1(2):1530--1541, Aug. 2008. Code: www.research.att.com/marioh/frequent-items.html.
[30]
G. Cormode and S. Muthukrishnan. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In In Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data, 2004.
[31]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55, 2004.
[32]
G. Cormode and S. Muthukrishnan. What's hot and what's not: Tracking most frequent items dynamically. ACM Trans. Database Syst., 2005.
[33]
E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. of the 10th Annual European Symposium on Algorithms, ESA. Springer-Verlag, 2002.
[34]
N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6):32--es, Dec. 2007.
[35]
N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 2007.
[36]
N. Duffield, Y. Xu, L. Xia, N. K. Ahmed, and M. Yu. Stream aggregation through order sampling. In ACM CIKM, 2017.
[37]
G. Einziger, O. Eytan, R. Friedman, and B. Manes. Adaptive software cache management. In Middleware. Association for Computing Machinery, 2018.
[38]
Google. Bigquery. https://cloud.google.com/bigquery.
[39]
V. M. Gottin, E. Pacheco, J. Dias, A. E. M. Ciarlini, B. Costa, W. Vieira, Y. M. Souto, P. Pires, F. Porto, and J. a. G. Rittmeyer. Automatic caching decision for scientific dataflow execution in apache spark. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR'18, New York, NY, USA, 2018. Association for Computing Machinery.
[40]
R. Harrison, Q. Cai, A. Gupta, and J. Rexford. Network-wide heavy hitter detection with commodity switches. In ACM SOSR, 2018.
[41]
C. A. R. Hoare. Algorithm 65: Find. Commun. ACM, 1961.
[42]
Q. Huang, X. Jin, P. P. C. Lee, R. Li, L. Tang, Y.-C. Chen, and G. Zhang. Sketchvisor: Robust network measurement for software packet processing. In ACM SIGCOMM, 2017.
[43]
Intel. Intel® 64 and ia-32 architectures software developer's manual. https://www.intel.com/content/dam/www/ public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf.
[44]
S. Jiang and X. Zhang. Lirs: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. ACM SIGMETRICS, pages 31--42, 2002.
[45]
X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica. Netcache: Balancing key-value stores with fast in-network caching. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 121--136, 2017.
[46]
R. Karedla, J. S. Love, and B. G. Wherry. Caching strategies to improve disk system performance. Computer, 27(3):38--46, Mar. 1994.
[47]
E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. In In SIROCCO, 2003.
[48]
D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. Lrfu: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. on Comp., 2001.
[49]
X. Li, R. Sethi, M. Kaminsky, D. G. Andersen, and M. J. Freedman. Be fast, cheap and in control with ???????????????. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pages 31--44, 2016.
[50]
Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, et al. Hpcc: High precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, pages 44--58, 2019.
[51]
M. Liberatore and P. Shenoy. Umass trace repository. http://traces.cs.umass.edu/index.php/Main/About, 2016.
[52]
Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, and I. Stoica. Distcache: Provable load balancing for large-scale storage systems with distributed caching. In USENIX FAST. USENIX Association, Feb. 2019.
[53]
Z. Liu, R. Ben-Basat, G. Einziger, Y. Kassner, V. Braverman, R. Friedman, and V. Sekar. Nitrosketch: Robust and general sketch-based monitoring in software switches. In ACM SIGCOMM, 2019.
[54]
Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In ACM SIGCOMM, 2016.
[55]
B. Manes. Caffeine: A high performance caching library for java 8. https://github.com/ben-manes/caffeine, 2016.
[56]
N. Megiddo and D. S. Modha. Arc: A self-tuning, low overhead replacement cache. In USENIX FAST, 2003.
[57]
A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In IN ICDT, 2005.
[58]
J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143--152, 1982.
[59]
M. Mitzenmacher. Some open questions related to cuckoo hashing. In European Symposium on Algorithms, pages 1--10. Springer, 2009.
[60]
S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh, V. Jeyakumar, and C. Kim. Language-directed hardware design for network performance monitoring. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 85--98, 2017.
[61]
R. Neugebauer, G. Antichi, J. F. Zazo, Y. Audzevich, S. López-Buedo, and A.W. Moore. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 327--341, 2018.
[62]
R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004.
[63]
S. Park and C. Park. Frd: A filtering based buffer cache algorithm that considers both frequency and reuse distance. In Proc. of the 33rd IEEE International Conference on Massive Storage Systems and Technology (MSST), 2017.
[64]
R. Shahout, R. Friedman, and R. Ben Basat. Together is better: Heavy hitters quantile estimation. Proceedings of the ACM on Management of Data, 1(1):1--25, 2023.
[65]
S. G. Sáez, V. Andrikopoulos, F. Leymann, and S. Strauch. Evaluating caching strategies for cloud data access using an enterprise service bus. In 2014 IEEE International Conference on Cloud Engineering, pages 289--294, 2014.
[66]
M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu. Cheetah: Accelerating database queries with switch pruning. In ACM SIGMOD, pages 2407--2422, 2020.
[67]
T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li, and S. Uhlig. Elastic sketch: adaptive and fast network-wide measurements. In SIGCOMM, 2018.
[68]
Z. Yu, C. Hu, J. Wu, X. Sun, V. Braverman, M. Chowdhury, Z. Liu, and X. Jin. Programmable packet scheduling with a single queue. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 179--193, 2021.
[69]
C. Zeng, L. Luo, T. Zhang, Z. Wang, L. Li, W. Han, N. Chen, L. Wan, L. Liu, Z. Ding, et al. Tiara: A scalable and efficient hardware acceleration architecture for stateful layer-4 load balancing. In USENIX NSDI, 2022.
[70]
Y. Zhao, W. Liu, F. Dong, T. Yang, Y. Li, K. Yang, Z. Liu, Z. Jia, and Y. Yang. P4lru: Towards an lru cache entirely in programmable data plane. In Proceedings of the ACM SIGCOMM 2023 Conference, pages 967--980, 2023.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Networking
Proceedings of the ACM on Networking  Volume 2, Issue CoNEXT3
PACMNET
September 2024
108 pages
EISSN:2834-5509
DOI:10.1145/3689614
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2024
Published in PACMNET Volume 2, Issue CoNEXT3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. heavy hitters
  2. in-network caching
  3. quantiles
  4. sampling
  5. streaming

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 72
    Total Downloads
  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)72
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media