Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs

Published: 22 May 2023 Publication History

Abstract

The wide adoption of the emerging SmartNIC technology creates new opportunities to offload application-level computation into the networking layer, which frees the burden of host CPUs, leading to performance improvement. Shuffle, the all-to-all data exchange process, is a critical building block for network communication in distributed data-intensive applications and can potentially benefit from SmartNICs.
In this paper, we develop SmartShuffle, which accelerates the data-intensive application's shuffle process by offloading various computation tasks into the SmartNIC devices. SmartShuffle supports offloading both low-level network functions, including data partitioning and network transport, and high-level computation tasks, including filtering, aggregation, and sorting. SmartShuffle adopts a coordinated offload architecture to make sender-side and receiver-side SmartNICs jointly contribute to the benefits of shuffle computation offload. SmartShuffle carefully manages the tight and time-varying computation and memory constraints on the device. We propose a liquid offloading approach, which dynamically migrates operators between the host CPU and the SmartNIC at runtime such that resources in both devices are fully utilized.
We prototype SmartShuffle on the Stingray SoC SmartNICs and plug it into Spark. Our evaluation shows that SmartShuffle improves host CPU efficiency and I/O efficiency with lower job completion time. SmartShuffle outperforms Spark, and Spark RDMA by up to 40% on TPC-H.

References

[1]
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud. https://github.com/Mellanox/ SparkRDMA.
[2]
Aws nitro system. https://aws.amazon.com/cn/ec2/nitro/.
[3]
Hadoop randomtextwriter. https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/ RandomTextWriter.html.
[4]
Netxtreme-e linux roce configuration guide. https://docs.broadcom.com/doc/netxtreme-e-roce-configuration-guide.
[5]
Nvidia collective communications library. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html.
[6]
Spark shuffle: Sparkrdma vs crail. https://craillabs.github.io/blog/2017/11/rdmashuffle.html.
[7]
Tim sort. http://wiki.c2.com/?TimSort.
[8]
Tpc-h benchmark. http://www.tpc.org/tpch/.
[9]
Catalina Alvarez, Zhenhao He, Gustavo Alonso, and Ankit Singla. Specializing the network for scatter-gather workloads. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 267--280, 2020.
[10]
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383--1394, 2015.
[11]
Apache Arrow. Ballista: Distributed sql query engine, built on apache arrow. https://github.com/apache/arrow-ballista.
[12]
Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. Rack-scale in-memory join processing using rdma. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1463--1475, 2015.
[13]
Claude Barthels, Ingo Müller, Timo Schneider, Gustavo Alonso, and Torsten Hoefler. Distributed join algorithms on thousands of cores. Proceedings of the VLDB Endowment, 10(5):517--528, 2017.
[14]
Broadcom. Stingray SmartNIC Adapters and IC. https://www.broadcom.com/products/ethernet-connectivity/smartnic.
[15]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.
[16]
Cavium Corporation. Cavium LiquidIO. http://www.cavium.com/pdfFiles/LiquidIO_Server_Adapters_PB_Rev1.2.pdf.
[17]
Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I Jordan, and Ion Stoica. Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 41(4):98--109, 2011.
[18]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, pages 215--226, 2016.
[19]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[20]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. Azure accelerated networking: Smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 51--66, 2018.
[21]
Robert Grandl, Arjun Singhvi, Raajay Viswanathan, and Aditya Akella. Whiz: A fast and flexible data analytics system. arXiv preprint arXiv:1703.10272, 2017.
[22]
Stewart Grant, Anil Yelam, Maxwell Bland, and Alex C Snoeren. Smartnic performance isolation with fairnic: Programmable networking for the cloud. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 681--693, 2020.
[23]
Joost Hoozemans, Johan Peltenburg, Fabian Nonnemacher, Akos Hadnagy, Zaid Al-Ars, and H Peter Hofstee. Fpga acceleration for big data analytics: Challenges and opportunities. IEEE Circuits and Systems Magazine, 21(2):30--47, 2021.
[24]
Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pages 41--51. IEEE, 2010.
[25]
Intel. Intel Unveils Infrastructure Processing Unit. https://www.intel.com/content/www/us/en/newsroom/news/ infrastructure-processing-unit-data-center.html.
[26]
Antoine Kaufmann, SImon Peter, Naveen Kr Sharma, Thomas Anderson, and Arvind Krishnamurthy. High performance packet processing with flexnic. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pages 67--81, 2016.
[27]
Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan Milojičić, and Gustavo Alonso. Farview: Disaggregated memory with operator off-loading for database engines. arXiv preprint arXiv:2106.07102, 2021.
[28]
Yanfang Le, Hyunseok Chang, Sarit Mukherjee, Limin Wang, Aditya Akella, Michael M Swift, and TV Lakshman. Uno: Uniflying host and smart nic offload for flexible packet processing. In Proceedings of the 2017 Symposium on Cloud Computing, pages 506--519, 2017.
[29]
Yanfang Le, Brent Stephens, Arjun Singhvi, Aditya Akella, and Michael M Swift. Rogue: Rdma over generic unconverged ethernet. In Proceedings of the ACM Symposium on Cloud Computing, pages 225--236, 2018.
[30]
Bojie Li, Kun Tan, Layong Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. Clicknp: Highly flexible and high performance network processing with reconfigurable hardware. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 1--14, 2016.
[31]
Feilong Liu, Lingyan Yin, and Spyros Blanas. Design and evaluation of an rdma-aware data shuffling operator for parallel database systems. ACM Transactions on Database Systems (TODS), 44(4):1--45, 2019.
[32]
Jianshen Liu, Carlos Maltzahn, Craig Ulmer, and Matthew Leon Curry. Performance characteristics of the bluefield-2 smartnic. arXiv preprint arXiv:2105.06619, 2021.
[33]
Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. Offloading distributed applications onto smartnics using ipipe. In Proceedings of the ACM Special Interest Group on Data Communication, pages 318--333. 2019.
[34]
Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. Incbricks: Toward in-network computation with an in-network cache. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 795--809, 2017.
[35]
Ming Liu, Simon Peter, Arvind Krishnamurthy, and Phitchaya Mangpo Phothilimthana. E3: Energy-efficient microservices on smartnic-accelerated servers. In USENIX annual technical conference, pages 363--378, 2019.
[36]
Mellanox Technologies. Mellanox BlueField SmartNIC. http://www.mellanox.com/related-docs/prod_adapter_cards/ PB_BlueField_Smart_NIC.pdf.
[37]
Mellanox Technologies. NVIDIA Mellanox BlueField DPU. https://www.nvidia.com/en-us/networking/products/data-processing-unit/.
[38]
Craig Mustard, Swati Goswami, Niloofar Gharavi, Joel Nider, Ivan Beschastnikh, and Alexandra Fedorova. Jumpgate: automating integration of network connected accelerators. In Proceedings of the 14th ACM International Conference on Systems and Storage, pages 1--12, 2021.
[39]
Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. Jumpgate: In-network processing as a service for data analytics. In 11th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 19), 2019.
[40]
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. Making sense of performance in data analytics frameworks. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI}, pages 293--307, 2015.
[41]
Phitchaya Mangpo Phothilimthana, Ming Liu, Antoine Kaufmann, Simon Peter, Rastislav Bodik, and Thomas Anderson. Floem: A programming system for nic-accelerated network applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 663--679, 2018.
[42]
Qifan Pu, Shivaram Venkataraman, and Ion Stoica. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), pages 193--206, 2019.
[43]
Shi Qiao, Adrian Nicoara, Jin Sun, Marc Friedman, Hiren Patel, and Jaliya Ekanayake. Hyper dimension shuffle: Efficient data repartition at petabyte scale in scope. Proceedings of the VLDB Endowment, 12(10):1113--1125, 2019.
[44]
Robert Ricci, Eric Eide, and The CloudLab Team. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX; login:, 2014.
[45]
Wolf Rödiger, Sam Idicula, Alfons Kemper, and Thomas Neumann. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 1194--1205. IEEE, 2016.
[46]
Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, et al. F1 query: Declarative querying at scale. Proceedings of the VLDB Endowment, 11(12):1835--1848, 2018.
[47]
Henry N Schuh, Weihao Liang, Ming Liu, Jacob Nelson, and Arvind Krishnamurthy. Xenic: Smartnic-accelerated distributed transactions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 740--755, 2021.
[48]
Min Shen, Ye Zhou, and Chandni Singh. Magnet: push-based shuffle service for large-scale data processing. Proceedings of the VLDB Endowment, 13(12):3382--3395, 2020.
[49]
Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F Wenisch, Monica Wong-Chan, Sean Clark, Milo MK Martin, Moray McLaren, Prashant Chandra, Rob Cauble, et al. 1rma: Re-envisioning remote memory access for multi-tenant datacenters. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 708--721, 2020.
[50]
Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Ana Klimovic, Adrian Schuepbach, and Bernard Metzler. Unification of temporary storage in the nodekernel architecture. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC}, pages 767--782, 2019.
[51]
Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Nikolas Ioannou, and Ioannis Koltsidas. Crail: A high-performance i/o architecture for distributed data processing. IEEE Data Eng. Bull., 40(1):38--49, 2017.
[52]
Konstantin Taranov, Benjamin Rothenberger, Adrian Perrig, and Torsten Hoefler. srdma--efficient nic-based authentication and encryption for remote direct memory access. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC}, pages 691--704, 2020.
[53]
Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. Cheetah: Accelerating database queries with switch pruning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2407--2422, 2020.
[54]
Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Ioannis Koltsidas, and Nikolas Ioannou. On the [ir] relevance of network performance for data processing. In 8th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 16), 2016.
[55]
Li Yilong, Jin Park Seo, and Ousterhout John. Millisort and milliquery: Large-scale data-intensive computing in milliseconds. In 17th {usenix} symposium on networked systems design and implementation ({nsdi} 20), pages 419--434, 2020.
[56]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 15--28, 2012.
[57]
Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J Freedman. Riffle: optimized shuffle service for large-scale data analytics. In Proceedings of the Thirteenth EuroSys Conference, pages 1--15, 2018.
[58]
Jiaxing Zhang, Hucheng Zhou, Rishan Chen, Xuepeng Fan, Zhenyu Guo, Haoxiang Lin, Jack Y Li, Wei Lin, Jingren Zhou, and Lidong Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 295--308, 2012.

Cited By

View all
  • (2024)An Integrated Solution for High-efficiency In-band Network TelemetryProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663425(115-121)Online publication date: 3-Aug-2024
  • (2024)D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated StorageACM Transactions on Architecture and Code Optimization10.1145/365658421:3(1-22)Online publication date: 9-Apr-2024
  • (2024)Offloading NVMe over Fabrics (NVMe-oF) to SmartNICs on an at-scale Distributed Testbed2024 IEEE 10th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft60951.2024.10588915(316-318)Online publication date: 24-Jun-2024
  • Show More Cited By

Index Terms

  1. Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
    Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 7, Issue 2
    POMACS
    June 2023
    247 pages
    EISSN:2476-1249
    DOI:10.1145/3599176
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 May 2023
    Published in POMACS Volume 7, Issue 2

    Check for updates

    Author Tags

    1. data analytics
    2. hardware offloading
    3. smartnic

    Qualifiers

    • Research-article

    Funding Sources

    • NSF CNS

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)888
    • Downloads (Last 6 weeks)86
    Reflects downloads up to 03 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Integrated Solution for High-efficiency In-band Network TelemetryProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663425(115-121)Online publication date: 3-Aug-2024
    • (2024)D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated StorageACM Transactions on Architecture and Code Optimization10.1145/365658421:3(1-22)Online publication date: 9-Apr-2024
    • (2024)Offloading NVMe over Fabrics (NVMe-oF) to SmartNICs on an at-scale Distributed Testbed2024 IEEE 10th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft60951.2024.10588915(316-318)Online publication date: 24-Jun-2024
    • (2024)Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasetsChemometrics and Intelligent Laboratory Systems10.1016/j.chemolab.2024.105067245(105067)Online publication date: Feb-2024
    • (2024)Policy advice and best practices on bias and fairness in AIEthics and Information Technology10.1007/s10676-024-09746-w26:2Online publication date: 29-Apr-2024
    • (2023)YamaProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624792(572-587)Online publication date: 30-Oct-2023
    • (2023)DComp: Efficient Offload of LSM-tree Compaction with Data Processing UnitsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605633(233-243)Online publication date: 7-Aug-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media