Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Octopus+: An RDMA-Enabled Distributed Persistent Memory File System

Published: 16 August 2021 Publication History

Abstract

Non-volatile memory and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. However, existing distributed file systems strictly isolate file system and network layers, and the heavy layered software designs leave high-speed hardware under-exploited.
In this article, we propose an RDMA-enabled distributed persistent memory file system, Octopus+, to redesign file system internal mechanisms by closely coupling non-volatile memory and RDMA features. For data operations, Octopus+ directly accesses a shared persistent memory pool to reduce memory copying overhead, and actively fetches and pushes data all in clients to rebalance the load between the server and network. For metadata operations, Octopus+ introduces self-identified remote procedure calls for immediate notification between file systems and networking, and an efficient distributed transaction mechanism for consistency. Octopus+ is enabled with replication feature to provide better availability. Evaluations on Intel Optane DC Persistent Memory Modules show that Octopus+ achieves nearly the raw bandwidth for large I/Os and orders of magnitude better performance than existing distributed file systems.

References

[1]
NVIDIA. 2013. Accelio. Retrieved June 20, 2021 from https://github.com/accelio/accelio.
[2]
CohortFS. 2014. Ceph over Accelio. Retrieved June 20, 2021 from https://www.cohortfs.com/sites/default/files/ceph%20day-boston-2014-06-10-matt-benjamin-cohortfs-mlx-xio-v5ez.pdf.
[3]
LWN.net. 2014. Support ext4 on NV-DIMMs. Retrieved June 20, 2021 from https://lwn.net/Articles/588218.
[4]
LWN.net. 2014. Supporting Filesystems in Persistent Memory. Retrieved June 20, 2021 from https://lwn.net/Articles/610174.
[5]
Mellanox. 2015. RDMA Improves Alluxio (Tachyon) Remote Read Bandwidth and CPU Utilization by up to 50%. Retrieved June 20, 2021 from https://community.mellanox.com/docs/DOC-2128.
[6]
SAP HANA. 2016. In-Memory Computing and Real Time Analytics. Retrieved June 20, 2021 from https://www.sap.com/products/hana.html.
[7]
GitHub. 2017. Crail: A Fast Multi-Tiered Distributed Direct Access File System. Retrieved June 20, 2021 from https://github.com/zrlio/crail.
[8]
Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, et al. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). 775–787. https://www.usenix.org/conference/atc18/presentation/aguilera.
[9]
InfiniBand Trade Association. 2009. InfiniBand Architecture Specification: Release 1.3. InfiniBand Trade Association.
[10]
Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 1077–1091. https://doi.org/10.1145/3373376.3378515
[11]
Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 133–146.
[12]
Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and Haibo Chen. 2019. Performance and protection in the ZoFS user-space NVM file system. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 478–493. https://doi.org/10.1145/3341301.3359637
[13]
Chet Douglas. 2015. RDMA with PMEM: Software mechanisms for enabling access to remote persistent memory. Retrieved June 20, 2021 from http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf.
[14]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.
[15]
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the 9th European Conference on Computer Systems (EuroSys’14). ACM, New York, NY, Article 15, 15 pages.
[16]
Gluster. 2020. GlusterFS on RDMA. Retrieved June 20, 2021 from https://gluster.readthedocs.io/en/latest/AdministratorGuide/RDMATransport/.
[17]
Michio Honda, Lars Eggert, and Douglas Santry. 2016. PASTE: Network stacks must integrate with NVMM abstractions. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, New York, NY, 183–189.
[18]
Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. 2018. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). USENIX Association, 187–200. https://www.usenix.org/conference/fast18/presentation/hwang.
[19]
Intel. 2019. Intel Optane DC Persistent Memory. Retrieved June 20, 2021 from https://www.intel.com/content/www/us/en/products/memory-storage/optane-dc-persistent-memory.html.
[20]
Intel. 2020. Intel Data Direct I/O Technology. Retrieved June 20, 2021 from https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html.
[21]
Nusrat Sharmin Islam, Md Wasi-Ur Rahman, Xiaoyi Lu, and Dhabaleswar K. Panda. 2016. High performance design for HDFS with byte-addressability of NVM and RDMA. In Proceedings of the 2016 International Conference on Supercomputing. ACM, New York, NY, 8.
[22]
Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, et al. 2019. Basic performance measurements of the Intel Optane DC persistent memory module. arXiv:1903.05714.
[23]
William K. Josephson, Lars A. Bongo, David Flynn, and Kai Li. 2010. DFS: A file system for virtualized flash storage. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10).
[24]
Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: Reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 494–508. https://doi.org/10.1145/3341301.3359631
[25]
Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter RPCs can be general and fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 1–16.
[26]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). 295–306.
[27]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design guidelines for high performance RDMA systems. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16).
[28]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185–201.
[29]
Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, and Thomas Anderson. 2017. Strata: A cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 460–477. https://doi.org/10.1145/3132747.3132770
[30]
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 2–13.
[31]
Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). https://www.usenix.org/conference/fast15/technical-sessions/presentation/lee.
[32]
Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 462–477. https://doi.org/10.1145/3341301.3359635
[33]
Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing.
[34]
Siyang Li, Youyou Lu, Jiwu Shu, Yang Hu, and Tao Li. 2017. Locofs: A loosely-coupled metadata service for distributed file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. 1–12.
[35]
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. Management 15, 32 (2014), 36.
[36]
Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 773–785. https://www.usenix.org/conference/atc17/technical-sessions/presentation/lu.
[37]
Youyou Lu, Jiwu Shu, and Long Sun. 2015. Blurred persistence in transactional persistent memory. In Proceedings of the 31st Conference on Massive Storage Systems and Technologies (MSST’15). IEEE, Los Alamitos, CA, 1–13.
[38]
Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu. 2014. Loose-ordering consistency for persistent memory. In Proceedings of the IEEE 32nd International Conference on Computer Design (ICCD’14). IEEE, Los Alamitos, CA.
[39]
Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 75–88.
[40]
Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13).
[41]
Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. 2020. AsymNVM: An efficient framework for implementing persistent data structures on asymmetric NVM architecture. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 757–773. https://doi.org/10.1145/3373376.3378511
[42]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). 103–114.
[43]
Christopher Mitchell, Kate Montgomery, Lamont Nelson, Siddhartha Sen, and Jinyang Li. 2016. Balancing CPU and network in the cell distributed B-tree store. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16).
[44]
Sundeep Narravula, A. Marnidala, Abhinav Vishnu, Karthikeyan Vaidyanathan, and Dhabaleswar K. Panda. 2007. High performance distributed lock management services using network-based remote atomic operations. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid’07). IEEE, Los Alamitos, CA, 583–590.
[45]
Jiaxin Ou, Jiwu Shu, and Youyou Lu. 2016. A high performance file system for non-volatile main memory. In Proceedings of the 11th European Conference on Computer Systems. ACM, New York, NY, 12.
[46]
John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, et al. 2015. The RAMCloud storage system. ACM Transactions on Computer Systems 33, 3 (2015), 1–55.
[47]
Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory persistency. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA’14). 265–276.
[48]
Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 24–33.
[49]
Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 323–337. https://doi.org/10.1145/3127479.3128610
[50]
Dmitri B. Strukov, Gregory S. Snider, Duncan R. Stewart, and R. Stanley Williams. 2008. The missing memristor found. Nature 453, 7191 (2008), 80–83.
[51]
Patrick Stuedi, Animesh Trivedi, Bernard Metzler, and Jonas Pfefferle. 2014. DaRPC: Data center RPC. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’14). ACM, New York, NY, 1–13.
[52]
Steven Swanson and Adrian M. Caulfield. 2013. Refactor, reduce, recycle: Restructuring the I/O stack for the future of storage. Computer 46, 8 (2013), 52–59.
[53]
Tom Talpey. 2015. Remote Access to Ultra-Low-Latency Storage. Retrieved June 20, 2021 from http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/Talpey-Remote_Access_Storage.pdf.
[54]
Shin-Yeh Tsai and Yiying Zhang. 2017. Lite kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles. 306–324.
[55]
Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, Xavier Guerin, Xiaoqiao Meng, and Shicong Meng. 2015. HydraDB: A resilient RDMA-driven key-value middleware for in-memory cluster computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, New York, NY, 22.
[56]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, New York, NY, 87–104.
[57]
Xiaojian Wu and A. L. Narasimha Reddy. 2011. SCMFS: A file system for storage class memory. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). ACM, New York, NY, Article 39, 11 pages.
[58]
Jian Xu and Steven Swanson. 2016. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 323–338.
[59]
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. 2017. NOVA-Fortis: A fault-tolerant non-volatile main memory file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 478–496. https://doi.org/10.1145/3132747.3132761
[60]
Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 221–234. https://www.usenix.org/conference/fast19/presentation/yang.
[61]
Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 169–182.
[62]
Jiacheng Zhang, Jiwu Shu, and Youyou Lu. 2016. ParaFS: A log-structured file system to exploit the internal parallelism of flash devices. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16).
[63]
Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 3–18. https://doi.org/10.1145/2694344.2694370
[64]
Shengan Zheng, Morteza Hoseinzadeh, and Steven Swanson. 2019. Ziggurat: A tiered file system for non-volatile main memories and disks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 207–219. https://www.usenix.org/conference/fast19/presentation/zheng.
[65]
Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. A durable and energy efficient main memory using phase change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 14–23.

Cited By

View all
  • (2024)Software-based Live Migration for Containerized RDMAProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663416(52-58)Online publication date: 3-Aug-2024
  • (2024)AquaSonic: Acoustic Manipulation of Underwater Data Center Operations and Resource Management2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00201(331-349)Online publication date: 19-May-2024
  • (2024)Design and Implementation of an Android-based Intelligent Archive System for Talent Cultivation2024 IEEE 13th International Conference on Communication Systems and Network Technologies (CSNT)10.1109/CSNT60213.2024.10546200(1-5)Online publication date: 6-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 17, Issue 3
August 2021
227 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3477268
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2021
Accepted: 01 January 2021
Revised: 01 November 2020
Received: 01 July 2020
Published in TOS Volume 17, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Storage system
  2. distributed system
  3. remote direct memory access
  4. persistent memory

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Key Research & Development Program of China
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)283
  • Downloads (Last 6 weeks)14
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Software-based Live Migration for Containerized RDMAProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663416(52-58)Online publication date: 3-Aug-2024
  • (2024)AquaSonic: Acoustic Manipulation of Underwater Data Center Operations and Resource Management2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00201(331-349)Online publication date: 19-May-2024
  • (2024)Design and Implementation of an Android-based Intelligent Archive System for Talent Cultivation2024 IEEE 13th International Conference on Communication Systems and Network Technologies (CSNT)10.1109/CSNT60213.2024.10546200(1-5)Online publication date: 6-Apr-2024
  • (2024)PMEMgreSQL: Embracing PostgreSQL with Persistent MemoryWeb and Big Data10.1007/978-981-97-7244-5_39(448-458)Online publication date: 31-Aug-2024
  • (2023)Progress on storage systems for disaggregated data centersSCIENTIA SINICA Informationis10.1360/SSI-2023-003453:8(1503)Online publication date: 17-Aug-2023
  • (2023)Conflux: Exploiting Persistent Memory and RDMA Bandwidth via Adaptive I/O Mode SelectionProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605574(685-694)Online publication date: 7-Aug-2023
  • (2023)Memento: A Framework for Detectable Recoverability in Persistent MemoryProceedings of the ACM on Programming Languages10.1145/35912327:PLDI(292-317)Online publication date: 6-Jun-2023
  • (2023)LightRPC: An Efficient and Simple RDMA-based RPC Framework2023 9th International Conference on Computer and Communications (ICCC)10.1109/ICCC59590.2023.10507594(903-907)Online publication date: 8-Dec-2023
  • (2023)Deep Learning on Image Stitching With Multi-viewpoint Images: A SurveyNeural Processing Letters10.1007/s11063-023-11226-z55:4(3863-3898)Online publication date: 23-Mar-2023
  • (2023)Quantification and analysis of performance fluctuation in distributed file systemCluster Computing10.1007/s10586-023-04141-427:3(3149-3162)Online publication date: 22-Sep-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media