Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3617232.3624857acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Scaling Up Memory Disaggregated Applications with SMART

Published: 17 April 2024 Publication History

Abstract

Recent developments in RDMA networks are leading to the trend of memory disaggregation. However, the performance of each compute node is still limited by the network, especially when it needs to perform a large number of concurrent fine-grained remote accesses. According to our evaluations, existing IOPS-bound disaggregated applications do not scale well beyond 32 cores, and therefore do not take full advantage of today's many-core machines.
After an in-depth analysis of the internal architecture of RNIC, we found three major scale-up bottlenecks that limit the throughput of today's disaggregated applications: (1) implicit contention of doorbell registers, (2) cache trashing caused by excessive outstanding work requests, and (3) wasted IOPS from unsuccessful CAS retries. However, the solutions to these problems involve many low-level details that are not familiar to application developers. To ease the burden on developers, we propose Smart, an RDMA programming framework that hides the above details by providing an interface similar to one-sided RDMA verbs.
We take 44 and 16 lines of code to refactor the state-of-the-art disaggregated hash table (RACE) and persistent transaction processing system (FORD) with Smart, improving their throughput by up to 132.4× and 5.2×, respectively. We have also refactored Sherman (a recent disaggregated B+Tree) with Smart and an additional speculative lookup optimization (48 lines of code changed), which changes its memory access pattern from bandwidth-bound to IOPS-bound and leads to a speedup of 2.0×. Smart is publicly available at https://github.com/madsys-dev/smart.

References

[1]
Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory improve job throughput? In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.
[2]
Dotan Barak. ibv_open_device() - rdmamojo. https://www.rdmamojo.com/2012/06/29/ibv_open_device/, 2022.
[3]
Irina Calciu, M Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. Rethinking software runtimes for disaggregated memory. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 79--92, 2021.
[4]
Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. Fast and general distributed transactions using rdma and htm. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1--17, 2016.
[5]
Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable rdma rpc on reliable connection with efficient resource sharing. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--14, 2019.
[6]
CXL Consortium. Compute express link. https://www.computeexpresslink.org/, 2022.
[7]
Gen-Z Consortium. Gen-z. https://genzconsortium.org/, 2022.
[8]
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143--154, 2010.
[9]
Hewlett Packard Corporation. The machine: A new kind of computer. https://www.hpl.hp.com/research/systems-research/themachine/, 2022.
[10]
Intel Corporation. Intel rack scale design (intel rsd). https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html, 2022.
[11]
Intel Corporation. Intel vtune profiler. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html, 2022.
[12]
NVIDIA Corporation. Mellanox adapters programmer's reference manual (prm). https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf, 2022.
[13]
NVIDIA Corporation. Neo-host. https://support.mellanox.com/s/productdetails/a2v50000000N2OlAAK/mellanox-neohost, 2022.
[14]
NVIDIA Corporation. Nvidia connectx infiniband adapters. https://www.nvidia.com/en-us/networking/infiniband-adapters/, 2022.
[15]
Diego Crupnicoff, Michael Kagan, Ariel Shahar, Noam Bloch, and Hillel Chapman. Dynamically-connected transport service, July 3 2012. US Patent 8,213,315.
[16]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. Farm: Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 401--414, 2014.
[17]
Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggregation. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 249--264, 2016.
[18]
Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. Direct access, high-performance memory disaggregation with directcxl. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 287--294, 2022.
[19]
Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J Weinberger. Quickly generating billion-record synthetic databases. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data, pages 243--252, 1994.
[20]
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 649--667, 2017.
[21]
Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. Clio: A hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 417--433, 2022.
[22]
Intel. Intel® ethernet controller e810. https://www.intel.com/content/www/us/en/products/details/ethernet/800-controllers/e810-controllers/docs.html, 2022.
[23]
Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. CXL-ANNS: Software-Hardware collaborative memory disaggregation and computation for Billion-Scale approximate nearest neighbor search. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 585--600, Boston, MA, July 2023. USENIX Association.
[24]
Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter rpcs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1--16, 2019.
[25]
Anuj Kalia, Michael Kaminsky, and David G Andersen. Using rdma efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM, pages 295--306, 2014.
[26]
Anuj Kalia, Michael Kaminsky, and David G Andersen. Design guidelines for high performance rdma systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 437--450, 2016.
[27]
Anuj Kalia, Michael Kaminsky, and David G Andersen. Fasst: Fast, scalable and simple distributed transactions with two-sided (rdma) datagram rpcs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 185--201, 2016.
[28]
Linux Kernel. Rdma core userspace libraries and daemons. https://github.com/linux-rdma/rdma-core/blob/master/providers/mlx5/qp.c#L787, 2022.
[29]
Alexey Khrabrov, Marius Pirvu, Vijay Sundaresan, and Eyal De Lara. Jitserver: Disaggregated caching jit compiler for the jvm in the cloud. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 869--884, 2022.
[30]
Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu, Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye, and Alvin R Lebeck Danyang Zhuo. Understanding rdma microarchitecture resources for performance isolation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023.
[31]
Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, and Danyang Zhuo. Collie: Finding performance anomalies in RDMA subsystems. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 287--305, Renton, WA, April 2022. USENIX Association.
[32]
Byung-Jae Kwak, Nah-Oak Song, and Leonard E Miller. Performance analysis of exponential backoff. IEEE/ACM transactions on networking, 13(2):343--355, 2005.
[33]
Sekwon Lee, Soujanya Ponnapalli, Sharad Singhal, Marcos K. Aguilera, Kimberly Keeton, and Vijay Chidambaram. Dinomo: An elastic, scalable, high-performance key-value store for disaggregated persistent memory. Proc. VLDB Endow., 15(13):4023 -- 4037, 2022.
[34]
Seung-seob Lee, Yanpeng Yu, Yupeng Tang, Anurag Khandelwal, Lin Zhong, and Abhishek Bhattacharjee. Mind: In-network memory management for disaggregated data centers. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 488--504, 2021.
[35]
Xuchuan Luo, Pengfei Zuo, Jiacheng Shen, Jiazhen Gu, Xin Wang, Michael R. Lyu, and Yangfan Zhou. Smart: A high-performance adaptive radix tree for disaggregated memory. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 553--571, Boston, MA, July 2023. USENIX Association.
[36]
Teng Ma, Tao Ma, Zhuo Song, Jingxuan Li, Huaixin Chang, Kang Chen, Hai Jiang, and Yongwei Wu. X-rdma: Effective rdma middleware in large-scale production environments. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--12. IEEE, 2019.
[37]
Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. Asymnvm: An efficient framework for implementing persistent data structures on asymmetric nvm architecture. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 757--773, 2020.
[38]
Mellanox. ibv_open_device() - rdmamojo. https://dlsvr04.asus.com.cn/pub/ASUS/mb/accessory/PEM-FDR/Manual/Mellanox_OFED_Linux_User_Manual_v2_3-1_0_1.pdf, 2022.
[39]
Robert M Metcalfe and David R Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(7):395--404, 1976.
[40]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 103--114, 2013.
[41]
Sumit Kumar Monga, Sanidhya Kashyap, and Changwoo Min. Birds of a feather flock together: Scaling rdma rpcs with flock. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 212--227, 2021.
[42]
The Boost Organization. Boost c++ libraries. https://www.boost.org/, 2022.
[43]
Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pages 13--24. IEEE Computer Society, 2014.
[44]
Zhenyuan Ruan, Malte Schwarzkopf, Marcos K Aguilera, and Adam Belay. Aifm: High-performance, application-integrated far memory. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 315--332, 2020.
[45]
Alex Shamis, Matthew Renzelmann, Stanko Novakovic, Georgios Chatzopoulos, Aleksandar Dragojević, Dushyanth Narayanan, and Miguel Castro. Fast general distributed transactions with opacity. In Proceedings of the 2019 International Conference on Management of Data, pages 433--448, 2019.
[46]
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. Legoos: A disseminated, distributed os for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 69--87, 2018.
[47]
Jiacheng Shen, Pengfei Zuo, Xuchuan Luo, Tianyi Yang, Yuxin Su, Yangfan Zhou, and Michael R Lyu. Fusee: A fully memory-disaggregated key-value store. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 81--98, 2023.
[48]
David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. Strom: smart remote memory. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1--16, 2020.
[49]
TATP. Telecom application transaction processing benchmark. http://tatpbenchmark.sourceforge.net/, 2022.
[50]
The H-Store Team. Smallbank benchmark. https://hstore.cs.brown.edu/documentation/deployment/benchmarks/smallbank/, 2022.
[51]
Muhammad Tirmazi, Adam Barker, Nan Deng, Md E Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. Borg: the next generation. In Proceedings of the fifteenth European conference on computer systems, pages 1--14, 2020.
[52]
Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 33--48, 2020.
[53]
Shin-Yeh Tsai and Yiying Zhang. Lite kernel rdma support for data-center applications. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 306--324, 2017.
[54]
Chenxi Wang, Haoran Ma, Shi Liu, Yuanqi Li, Zhenyuan Ruan, Khanh Nguyen, Michael D Bond, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. Semeru: A memory-disaggregated managed runtime. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 261--280, 2020.
[55]
Chenxi Wang, Haoran Ma, Shi Liu, Yifan Qiao, Jonathan Eyolfson, Christian Navasca, Shan Lu, and Guoqing Harry Xu. Memliner: Lining up tracing and application for a far-memory-friendly runtime. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 35--53, 2022.
[56]
Qing Wang. Sherman: A write-optimized distributed b+tree index on disaggregated memory. https://github.com/thustorage/Sherman/, 2022.
[57]
Qing Wang, Youyou Lu, and Jiwu Shu. Sherman: A write-optimized distributed b+ tree index on disaggregated memory. In Proceedings of the 2022 International Conference on Management of Data, pages 1033--1048, 2022.
[58]
Ruihong Wang, Jianguo Wang, Stratos Idreos, M. Tamer Özsu, and Walid G. Aref. The case for distributed shared-memory databases with rdma-enabled memory disaggregation. Proc. VLDB Endow., 16(1):15--22, nov 2022.
[59]
Zixuan Wang, Joonseop Sim, Euicheol Lim, and Jishen Zhao. Enabling efficient large-scale deep learning training with cache coherent disaggregated memory systems. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 126--140. IEEE, 2022.
[60]
Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen. Deconstructing rdma-enabled distributed transactions: Hybrid is better! In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 233--251, 2018.
[61]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast in-memory transaction processing using rdma and htm. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 87--104, 2015.
[62]
Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. The end of a myth: Distributed transactions can scale. Proceedings of the VLDB Endowment, 10(6), 2017.
[63]
Erfan Zamanian, Xiangyao Yu, Michael Stonebraker, and Tim Kraska. Rethinking database high availability with rdma networks. Proceedings of the VLDB Endowment, 12(11):1637--1650, 2019.
[64]
Ming Zhang, Yu Hua, Pengfei Zuo, and Lurong Liu. Ford: Fast onesided rdma-based distributed transactions for disaggregated persistent memory. In 20th USENIX Conference on File and Storage Technologies (FAST 22), pages 51--68, 2022.
[65]
Yingqiang Zhang, Chaoyi Ruan, Cheng Li, Xinjun Yang, Wei Cao, Feifei Li, Bo Wang, Jing Fang, Yuhui Wang, Jingze Huo, et al. Towards cost-effective and elastic cloud database deployment via memory dis-aggregation. Proceedings of the VLDB Endowment, 14(10):1900--1912, 2021.
[66]
Yang Zhou, Hassan MG Wassel, Sihang Liu, Jiaqi Gao, James Mickens, Minlan Yu, Chris Kennelly, Paul Turner, David E Culler, Henry M Levy, et al. Carbink: Fault-tolerant far memory. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 55--71, 2022.
[67]
Tobias Ziegler, Sumukha Tumkur Vani, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. Designing distributed tree-based index structures for fast rdma-capable networks. In Proceedings of the 2019 International Conference on Management of Data, pages 741--758, 2019.
[68]
Pengfei Zuo, Jiazhao Sun, Liu Yang, Shuangwu Zhang, and Yu Hua. One-sided rdma-conscious extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 15--29, 2021.
[69]
Pengfei Zuo, Qihui Zhou, Jiazhao Sun, Liu Yang, Shuangwu Zhang, Yu Hua, James Cheng, Rongfeng He, and Huabing Yan. Race: Onesided rdma-conscious extendible hashing. ACM Transactions on Storage (TOS), 18(2):1--29, 2022. Received 2023-04-20; accepted 2023-07-29

Cited By

View all
  • (2024)Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value StoresProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695951(127-143)Online publication date: 4-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1
April 2024
494 pages
ISBN:9798400703720
DOI:10.1145/3617232
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024

Check for updates

Badges

Author Tags

  1. disaggrgated memory
  2. one-sided RDMA
  3. scale-up

Qualifiers

  • Research-article

Funding Sources

  • National Key Research & Development Program of China
  • Natural Science Foundation of China
  • Young Elite Scientists Sponsorship Program by CAST

Conference

ASPLOS '24

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,287
  • Downloads (Last 6 weeks)220
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value StoresProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695951(127-143)Online publication date: 4-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media