research-article

Open access

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs

Authors:

Aditya AkellaAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 7, Issue 2

Article No.: 36, Pages 1 - 23

https://doi.org/10.1145/3589980

Published: 22 May 2023 Publication History

Abstract

The wide adoption of the emerging SmartNIC technology creates new opportunities to offload application-level computation into the networking layer, which frees the burden of host CPUs, leading to performance improvement. Shuffle, the all-to-all data exchange process, is a critical building block for network communication in distributed data-intensive applications and can potentially benefit from SmartNICs.

In this paper, we develop SmartShuffle, which accelerates the data-intensive application's shuffle process by offloading various computation tasks into the SmartNIC devices. SmartShuffle supports offloading both low-level network functions, including data partitioning and network transport, and high-level computation tasks, including filtering, aggregation, and sorting. SmartShuffle adopts a coordinated offload architecture to make sender-side and receiver-side SmartNICs jointly contribute to the benefits of shuffle computation offload. SmartShuffle carefully manages the tight and time-varying computation and memory constraints on the device. We propose a liquid offloading approach, which dynamically migrates operators between the host CPU and the SmartNIC at runtime such that resources in both devices are fully utilized.

We prototype SmartShuffle on the Stingray SoC SmartNICs and plug it into Spark. Our evaluation shows that SmartShuffle improves host CPU efficiency and I/O efficiency with lower job completion time. SmartShuffle outperforms Spark, and Spark RDMA by up to 40% on TPC-H.

References

[1]

Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud. https://github.com/Mellanox/ SparkRDMA.

[2]

Aws nitro system. https://aws.amazon.com/cn/ec2/nitro/.

[3]

Hadoop randomtextwriter. https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/ RandomTextWriter.html.

[4]

Netxtreme-e linux roce configuration guide. https://docs.broadcom.com/doc/netxtreme-e-roce-configuration-guide.

[5]

Nvidia collective communications library. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html.

[6]

Spark shuffle: Sparkrdma vs crail. https://craillabs.github.io/blog/2017/11/rdmashuffle.html.

[7]

Tim sort. http://wiki.c2.com/?TimSort.

[8]

Tpc-h benchmark. http://www.tpc.org/tpch/.

[9]

Catalina Alvarez, Zhenhao He, Gustavo Alonso, and Ankit Singla. Specializing the network for scatter-gather workloads. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 267--280, 2020.

Digital Library

[10]

Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383--1394, 2015.

Digital Library

[11]

Apache Arrow. Ballista: Distributed sql query engine, built on apache arrow. https://github.com/apache/arrow-ballista.

[12]

Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. Rack-scale in-memory join processing using rdma. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1463--1475, 2015.

Digital Library

[13]

Claude Barthels, Ingo Müller, Timo Schneider, Gustavo Alonso, and Torsten Hoefler. Distributed join algorithms on thousands of cores. Proceedings of the VLDB Endowment, 10(5):517--528, 2017.

Digital Library

[14]

Broadcom. Stingray SmartNIC Adapters and IC. https://www.broadcom.com/products/ethernet-connectivity/smartnic.

[15]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.

[16]

Cavium Corporation. Cavium LiquidIO. http://www.cavium.com/pdfFiles/LiquidIO_Server_Adapters_PB_Rev1.2.pdf.

[17]

Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I Jordan, and Ion Stoica. Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 41(4):98--109, 2011.

Digital Library

[18]

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, pages 215--226, 2016.

Digital Library

[19]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[20]

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. Azure accelerated networking: Smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 51--66, 2018.

[21]

Robert Grandl, Arjun Singhvi, Raajay Viswanathan, and Aditya Akella. Whiz: A fast and flexible data analytics system. arXiv preprint arXiv:1703.10272, 2017.

[22]

Stewart Grant, Anil Yelam, Maxwell Bland, and Alex C Snoeren. Smartnic performance isolation with fairnic: Programmable networking for the cloud. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 681--693, 2020.

Digital Library

[23]

Joost Hoozemans, Johan Peltenburg, Fabian Nonnemacher, Akos Hadnagy, Zaid Al-Ars, and H Peter Hofstee. Fpga acceleration for big data analytics: Challenges and opportunities. IEEE Circuits and Systems Magazine, 21(2):30--47, 2021.

[24]

Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pages 41--51. IEEE, 2010.

[25]

Intel. Intel Unveils Infrastructure Processing Unit. https://www.intel.com/content/www/us/en/newsroom/news/ infrastructure-processing-unit-data-center.html.

[26]

Antoine Kaufmann, SImon Peter, Naveen Kr Sharma, Thomas Anderson, and Arvind Krishnamurthy. High performance packet processing with flexnic. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pages 67--81, 2016.

Digital Library

[27]

Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan Milojičić, and Gustavo Alonso. Farview: Disaggregated memory with operator off-loading for database engines. arXiv preprint arXiv:2106.07102, 2021.

[28]

Yanfang Le, Hyunseok Chang, Sarit Mukherjee, Limin Wang, Aditya Akella, Michael M Swift, and TV Lakshman. Uno: Uniflying host and smart nic offload for flexible packet processing. In Proceedings of the 2017 Symposium on Cloud Computing, pages 506--519, 2017.

Digital Library

[29]

Yanfang Le, Brent Stephens, Arjun Singhvi, Aditya Akella, and Michael M Swift. Rogue: Rdma over generic unconverged ethernet. In Proceedings of the ACM Symposium on Cloud Computing, pages 225--236, 2018.

Digital Library

[30]

Bojie Li, Kun Tan, Layong Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. Clicknp: Highly flexible and high performance network processing with reconfigurable hardware. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 1--14, 2016.

Digital Library

[31]

Feilong Liu, Lingyan Yin, and Spyros Blanas. Design and evaluation of an rdma-aware data shuffling operator for parallel database systems. ACM Transactions on Database Systems (TODS), 44(4):1--45, 2019.

[32]

Jianshen Liu, Carlos Maltzahn, Craig Ulmer, and Matthew Leon Curry. Performance characteristics of the bluefield-2 smartnic. arXiv preprint arXiv:2105.06619, 2021.

[33]

Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. Offloading distributed applications onto smartnics using ipipe. In Proceedings of the ACM Special Interest Group on Data Communication, pages 318--333. 2019.

Digital Library

[34]

Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. Incbricks: Toward in-network computation with an in-network cache. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 795--809, 2017.

Digital Library

[35]

Ming Liu, Simon Peter, Arvind Krishnamurthy, and Phitchaya Mangpo Phothilimthana. E3: Energy-efficient microservices on smartnic-accelerated servers. In USENIX annual technical conference, pages 363--378, 2019.

[36]

Mellanox Technologies. Mellanox BlueField SmartNIC. http://www.mellanox.com/related-docs/prod_adapter_cards/ PB_BlueField_Smart_NIC.pdf.

[37]

Mellanox Technologies. NVIDIA Mellanox BlueField DPU. https://www.nvidia.com/en-us/networking/products/data-processing-unit/.

[38]

Craig Mustard, Swati Goswami, Niloofar Gharavi, Joel Nider, Ivan Beschastnikh, and Alexandra Fedorova. Jumpgate: automating integration of network connected accelerators. In Proceedings of the 14th ACM International Conference on Systems and Storage, pages 1--12, 2021.

Digital Library

[39]

Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. Jumpgate: In-network processing as a service for data analytics. In 11th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 19), 2019.

[40]

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. Making sense of performance in data analytics frameworks. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI}, pages 293--307, 2015.

[41]

Phitchaya Mangpo Phothilimthana, Ming Liu, Antoine Kaufmann, Simon Peter, Rastislav Bodik, and Thomas Anderson. Floem: A programming system for nic-accelerated network applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 663--679, 2018.

[42]

Qifan Pu, Shivaram Venkataraman, and Ion Stoica. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), pages 193--206, 2019.

[43]

Shi Qiao, Adrian Nicoara, Jin Sun, Marc Friedman, Hiren Patel, and Jaliya Ekanayake. Hyper dimension shuffle: Efficient data repartition at petabyte scale in scope. Proceedings of the VLDB Endowment, 12(10):1113--1125, 2019.

Digital Library

[44]

Robert Ricci, Eric Eide, and The CloudLab Team. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX; login:, 2014.

[45]

Wolf Rödiger, Sam Idicula, Alfons Kemper, and Thomas Neumann. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 1194--1205. IEEE, 2016.

[46]

Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, et al. F1 query: Declarative querying at scale. Proceedings of the VLDB Endowment, 11(12):1835--1848, 2018.

Digital Library

[47]

Henry N Schuh, Weihao Liang, Ming Liu, Jacob Nelson, and Arvind Krishnamurthy. Xenic: Smartnic-accelerated distributed transactions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 740--755, 2021.

Digital Library

[48]

Min Shen, Ye Zhou, and Chandni Singh. Magnet: push-based shuffle service for large-scale data processing. Proceedings of the VLDB Endowment, 13(12):3382--3395, 2020.

Digital Library

[49]

Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F Wenisch, Monica Wong-Chan, Sean Clark, Milo MK Martin, Moray McLaren, Prashant Chandra, Rob Cauble, et al. 1rma: Re-envisioning remote memory access for multi-tenant datacenters. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 708--721, 2020.

Digital Library

[50]

Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Ana Klimovic, Adrian Schuepbach, and Bernard Metzler. Unification of temporary storage in the nodekernel architecture. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC}, pages 767--782, 2019.

[51]

Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Nikolas Ioannou, and Ioannis Koltsidas. Crail: A high-performance i/o architecture for distributed data processing. IEEE Data Eng. Bull., 40(1):38--49, 2017.

[52]

Konstantin Taranov, Benjamin Rothenberger, Adrian Perrig, and Torsten Hoefler. srdma--efficient nic-based authentication and encryption for remote direct memory access. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC}, pages 691--704, 2020.

[53]

Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. Cheetah: Accelerating database queries with switch pruning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2407--2422, 2020.

Digital Library

[54]

Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Ioannis Koltsidas, and Nikolas Ioannou. On the [ir] relevance of network performance for data processing. In 8th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 16), 2016.

Digital Library

[55]

Li Yilong, Jin Park Seo, and Ousterhout John. Millisort and milliquery: Large-scale data-intensive computing in milliseconds. In 17th {usenix} symposium on networked systems design and implementation ({nsdi} 20), pages 419--434, 2020.

[56]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 15--28, 2012.

[57]

Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J Freedman. Riffle: optimized shuffle service for large-scale data analytics. In Proceedings of the Thirteenth EuroSys Conference, pages 1--15, 2018.

Digital Library

[58]

Jiaxing Zhang, Hucheng Zhou, Rishan Chen, Xuepeng Fan, Zhenyu Guo, Haoxiang Lin, Jack Y Li, Wei Lin, Jingren Zhou, and Lidong Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 295--308, 2012.

Cited By

Xiong XXie YChen XZheng SHuang WFeng J(2024)An Integrated Solution for High-efficiency In-band Network TelemetryProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663425(115-121)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.1145/3663408.3663425
Ding CZhou JLu KLi SXiong YWan JZhan L(2024)D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated StorageACM Transactions on Architecture and Code Optimization10.1145/365658421:3(1-22)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3656584
Basu SNadig D(2024)Offloading NVMe over Fabrics (NVMe-oF) to SmartNICs on an at-scale Distributed Testbed2024 IEEE 10th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft60951.2024.10588915(316-318)Online publication date: 24-Jun-2024
https://doi.org/10.1109/NetSoft60951.2024.10588915
Show More Cited By

Index Terms

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs
1. Networks
  1. Network services
    1. In-network processing

Recommendations

Offloading distributed applications onto smartNICs using iPipe
SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication

Emerging Multicore SoC SmartNICs, enclosing rich computing resources (e.g., a multicore processor, onboard DRAM, accelerators, programmable DMA engines), hold the potential to offload generic datacenter server tasks. However, it is unclear how to use a ...
Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Emerging SmartNIC creates new opportunities to offload application-level computation into the networking layer. Shuffle, the all-to-all data exchange process, is a critical building block for network communication in distributed data-intensive ...
Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs
SIGMETRICS '23

Emerging SmartNIC creates new opportunities to offload application-level computation into the networking layer. Shuffle, the all-to-all data exchange process, is a critical building block for network communication in distributed data-intensive ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 7, Issue 2

POMACS

June 2023

247 pages

EISSN:2476-1249

DOI:10.1145/3599176

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2023

Published in POMACS Volume 7, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF CNS

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,956
Total Downloads

Downloads (Last 12 months)888
Downloads (Last 6 weeks)86

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiong XXie YChen XZheng SHuang WFeng J(2024)An Integrated Solution for High-efficiency In-band Network TelemetryProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663425(115-121)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.1145/3663408.3663425
Ding CZhou JLu KLi SXiong YWan JZhan L(2024)D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated StorageACM Transactions on Architecture and Code Optimization10.1145/365658421:3(1-22)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3656584
Basu SNadig D(2024)Offloading NVMe over Fabrics (NVMe-oF) to SmartNICs on an at-scale Distributed Testbed2024 IEEE 10th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft60951.2024.10588915(316-318)Online publication date: 24-Jun-2024
https://doi.org/10.1109/NetSoft60951.2024.10588915
Duarte BAtkinson AOliveira N(2024)Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasetsChemometrics and Intelligent Laboratory Systems10.1016/j.chemolab.2024.105067245(105067)Online publication date: Feb-2024
https://doi.org/10.1016/j.chemolab.2024.105067
Alvarez JColmenarejo AElobaid AFabbrizzi SFahimi MFerrara AGhodsi SMougan CPapageorgiou IReyero PRusso MScott KState LZhao XRuggieri S(2024)Policy advice and best practices on bias and fairness in AIEthics and Information Technology10.1007/s10676-024-09746-w26:2Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1007/s10676-024-09746-w
Ji TSaxena DStephens BAkella A(2023)YamaProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624792(572-587)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624792
Ding CZhou JWan JXiong YLi SChen SLiu HTang LZhan LLu KXu P(2023)DComp: Efficient Offload of LSM-tree Compaction with Data Processing UnitsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605633(233-243)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605633

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents