research-article

Scalable RDMA RPC on Reliable Connection with Efficient Resource Sharing

Authors:

Jiwu ShuAuthors Info & Claims

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

Article No.: 19, Pages 1 - 14

https://doi.org/10.1145/3302424.3303968

Published: 25 March 2019 Publication History

Abstract

RDMA provides extremely low latency and high bandwidth to distributed systems. Unfortunately, it fails to scale and suffers from performance degradation when transferring data to an increasing number of targets on Reliable Connection (RC). We observe that the above scalability issue has its root in the resource contention in the NIC cache, the CPU cache and the memory of each server. In this paper, we propose ScaleRPC, an efficient RPC primitive using one-sided RDMA verbs on reliable connection to provide scalable performance. To effectively alleviate the resource contention, ScaleRPC introduces 1) connection grouping to organize the network connections into groups, so as to balance the saturation and thrashing of the NIC cache; 2) virtualized mapping to enable a single message pool to be shared by different groups of connections, which reduces CPU cache misses and improve memory utilization. Such scalable connection management provides substantial performance benefits: By deploying ScaleRPC both in a distributed file system and a distributed transactional system, we observe that it achieves high scalability and respectively improves performance by up to 90% and 160% for metadata accessing and SmallBank transaction processing.

References

[1]

2013. Mellanox Technologies. Connect-IB: Architecture for Scalable High Performance Computing. http://www.mellanox.com/related-docs/applications/SB_Connect-IB.pdf.

[2]

2016. Processor Counter Monitor (PCM). "https://github.com/opcm/pcm".

[3]

2016. SAP HANA, In-memory computing and real time analytics. "http://go.sap.com/product/technology-platform/hana.html".

[4]

2017. Crail: A Fast Multi-tiered Distributed Direct Access File System. https://github.com/zrlio/crail.

[5]

Mohammad Alomari, Michael Cahill, Alan Fekete, and Uwe Rohm. 2008. The cost of serializability on platforms that use snapshot isolation. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 576--585.

Digital Library

[6]

Kalia Anuj, Kaminsky Michael, and Andersen David. 2019. Datacenter RPCs can be General and Fast. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).

Digital Library

[7]

Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, and Omer Asad. 2003. NFS over RDMA. In Proceedings of the ACM SIGCOMM Workshop on Network-I/O Convergence: Experience, Lessons, Implications (NICELI '03). ACM, 196--208.

Digital Library

[8]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38 (2015), 28--38.

[9]

Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 26.

Digital Library

[10]

Sean Cochrane, K Kutzer, and L McIntosh. 2009. Solving the HPC I/O bottleneck: SunâĎć LustreâĎć storage system. Sun BluePrintsàĎć Online, Sun Microsystems (2009).

[11]

Intel Corporation. 2012. Intel data direct I/O technology (Intel DDIO): A primer. "http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf".

[12]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[13]

Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 401--414.

Digital Library

[14]

Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 54--70.

Digital Library

[15]

Nusrat Sharmin Islam, Md Wasiur Rahman, Xiaoyi Lu, and Dhabaleswar K Panda. 2016. High Performance Design for HDFS with Byte-Addressability of NVM and RDMA. In Proceedings of the 2016 International Conference on Supercomputing. ACM.

Digital Library

[16]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In SIGCOMM.

Digital Library

[17]

Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. Design Guidelines for High Performance RDMA Systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16).

Digital Library

[18]

Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. FaSST: fast, scalable and simple distributed transactions with two-sided RDMA datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 185--201.

Digital Library

[19]

Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing.

Digital Library

[20]

Hyeontaek Lim, Dongsu Han, David G Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. management 15, 32 (2014), 36.

[21]

Jiuxing Liu, Amith R Mamidala, and Dhabaleswar K Panda. 2004. Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support. In Proceedings of the 18th International Parallel and Distributed Processing Symposium. IEEE, 10.

[22]

Jiuxing Liu, Jiesheng Wu, Sushmitha P Kini, Pete Wyckoff, and Dhabaleswar K Panda. 2003. High performance RDMA-based MPI implementation over InfiniBand. In Proceedings of the 17th annual international conference on Supercomputing. ACM, 295--304.

Digital Library

[23]

Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: an RDMA-enabled distributed persistent memory file system. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, 773--785.

Digital Library

[24]

Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13). 103--114.

Digital Library

[25]

John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin Park, Henry Qin, Mendel Rosenblum, et al. 2015. The RAMCloud storage system. ACM Transactions on Computer Systems (TOCS) 33, 3 (2015), 7.

Digital Library

[26]

Yufei Ren, Xingbo Wu, Li Zhang, Yandong Wang, Wei Zhang, Zijun Wang, Michel Hack, and Song Jiang. 2017. iRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems. 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (2017), 231--238.

[27]

Galen Shipman, David Dillow, Sarp Oral, Feiyi Wang, Douglas Fuller, Jason Hill, and Zhe Zhang. 2010. Lessons learned in deploying the worldâĂ&Zacute;s largest scale lustre file system. In The 52nd Cray user group conference.

[28]

Galen M Shipman, Timothy S Woodall, Richard L Graham, Arthur B Maccabe, and Patrick G Bridges. 2006. Infiniband scalability in Open MPI. In Proceedings of the 20th International Parallel and Distributed Processing Symposium. IEEE, 10-pp.

Digital Library

[29]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010), 1--10.

Digital Library

[30]

Patrick Stuedi, Animesh Trivedi, Bernard Metzler, and Jonas Pfefferle. 2014. DaRPC: Data center rpc. In Proceedings of the ACM Symposium on Cloud Computing (SoCC). ACM, 1--13.

Digital Library

[31]

Hari Subramoni, Khaled Hamidouche, Akshey Venkatesh, Sourav Chakraborty, and Dhabaleswar K Panda. 2014. Designing MPI library with dynamic connected transport (DCT) of InfiniBand: early experiences. In International Supercomputing Conference. Springer, 278--295.

Digital Library

[32]

Shin-Yeh Tsai and Yiying Zhang. 2017. Lite kernel rdma support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 306--324.

Digital Library

[33]

Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, Xavier Guerin, Xiaoqiao Meng, and Shicong Meng. 2015. HydraDB: a resilient RDMA-driven key-value middleware for in-memory cluster computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22.

Digital Library

[34]

Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen. 2018. Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 233--251.

Digital Library

[35]

Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 87--104.

Digital Library

[36]

Jiesheng Wu, Pete Wyckoff, and Dhabaleswar Panda. 2003. PVFS over InfiniBand: Design and performance evaluation. In Proceedings of the 2003 International Conference on Parallel Processing. IEEE, 125--132.

[37]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10--10 (2010), 95.

Digital Library

[38]

Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 523--536.

Digital Library

Cited By

Huang JZhang MMa TLiu ZLin SChen KJiang JLiao XShan YZhang NLu MMa TGong HWu YWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and NodesProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695967(421-437)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695967
Wu KDong DXu W(2024)COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol CodesignACM Transactions on Architecture and Code Optimization10.1145/366052521:3(1-26)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3660525
Ren FZhang MChen KXia HChen ZWu YTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Scaling Up Memory Disaggregated Applications with SMARTProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624857(351-367)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624857
Show More Cited By

Recommendations

A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

One important issue the designer of a scalable shared-memory multiprocessor must deal with is the amount of extra memory required to store the directory information. It is desirable that the directory memory overhead be kept as low as possible, and that ...
Scalable Cooperative Caching with RDMA-Based Directory Management for Large-Scale Data Processing
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Cooperative caching provides an extensive virtual file cache by combining file caches on all nodes. We propose a novel cooperative caching method that addresses two problems of existing methods: lack of utilization of high-throughput, low-latency remote ...
An efficient design for fast memory registration in RDMA

Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

March 2019

714 pages

ISBN:9781450362818

DOI:10.1145/3302424

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research & Development Program of China
Huawei Innovation Research Program
the National Natural Science Foundation of China

Conference

EuroSys '19

Sponsor:

SIGOPS

EuroSys '19: Fourteenth EuroSys Conference 2019

March 25 - 28, 2019

Dresden, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
2,202
Total Downloads

Downloads (Last 12 months)266
Downloads (Last 6 weeks)44

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang JZhang MMa TLiu ZLin SChen KJiang JLiao XShan YZhang NLu MMa TGong HWu YWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and NodesProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695967(421-437)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695967
Wu KDong DXu W(2024)COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol CodesignACM Transactions on Architecture and Code Optimization10.1145/366052521:3(1-26)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3660525
Ren FZhang MChen KXia HChen ZWu YTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Scaling Up Memory Disaggregated Applications with SMARTProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624857(351-367)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624857
Geng LWang HMeng JFan DBen-Romdhane SPichumani HPhegade VZhang X(2024)RR-Compound: RDMA-Fused gRPC for Low Latency, High Throughput, and Easy InterfaceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340439435:8(1488-1505)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3404394
Song ZWu JMa TWang ZKong LWen ZLi JLu YYang YMa TLiu ZChen G(2024)Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMAIEEE/ACM Transactions on Networking10.1109/TNET.2024.339451432:4(3499-3514)Online publication date: Aug-2024
https://doi.org/10.1109/TNET.2024.3394514
Qiao PZhang ZLi YYuan YWang SWang GYu J(2024)AStore: Uniformed Adaptive Learned Index and Cache for RDMA-enabled Key-Value StoreIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.3355100(1-18)Online publication date: 2024
https://doi.org/10.1109/TKDE.2024.3355100
Sun HTan YWu YZhu JHuang QYao XZhang G(2024) RB 2 : Narrow the Gap between RDMA Abstraction and Performance via a Middle Layer IEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621169(1071-1080)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621169
Tran TKuncham GRamesh BXu SSubramoni HAbduljabbar MPanda D(2024)OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design2024 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI63208.2024.00019(47-56)Online publication date: 21-Aug-2024
https://doi.org/10.1109/HOTI63208.2024.00019
Noferesti MEzzati-Jivan N(2024)Enhancing empirical software performance engineering research with kernel-level events: A comprehensive system tracing approachJournal of Systems and Software10.1016/j.jss.2024.112117216(112117)Online publication date: Oct-2024
https://doi.org/10.1016/j.jss.2024.112117
Wu KDong DXu W(2024)A lightweight RDMA Connection Protocol based on Post-hoc ConfirmationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104991(104991)Online publication date: Oct-2024
https://doi.org/10.1016/j.jpdc.2024.104991
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents