Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Mojim: A Reliable and Highly-Available Non-Volatile Memory System

Published: 14 March 2015 Publication History

Abstract

Next-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage systems. These high-performance storage systems would be especially useful in large-scale data center environments where reliability and availability are critical. However, providing reliability and availability to NVMM is challenging, since the latency of data replication can overwhelm the low latency that NVMM should provide. We propose Mojim, a system that provides the reliability and availability that large-scale storage systems require, while preserving the performance of NVMM. Mojim achieves these goals by using a two-tier architecture in which the primary tier contains a mirrored pair of nodes and the secondary tier contains one or more secondary backup nodes with weakly consistent copies of data. Mojim uses highly-optimized replication protocols, software, and networking stacks to minimize replication costs and expose as much of NVMM?s performance as possible. We evaluate Mojim using raw DRAM as a proxy for NVMM and using an industrial NVMM emulation system. We find that Mojim provides replicated NVMM with similar or even better performance than un-replicated NVMM (reducing latency by 27% to 63% and delivering between 0.4 to 2.7X the throughput). We demonstrate that replacing MongoDB's built-in replication system with Mojim improves MongoDB's performance by 3.4 to 4X.

References

[1]
Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI '02), Boston, Massachusetts, December 2002.
[2]
Peter A. Alsberg and John D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering (ICSE '76), San Francisco, California, October 1976.
[3]
Katelin Bailey, Luis Ceze, Steven D. Gribble, and Henry M. Levy. Operating system implications of fast, cheap, non- volatile memory. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systemsi (HotOS '13), Napa, California, May 2011.
[4]
Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, Washington, November 2006.
[5]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), Cascais, Portugal, October 2011.
[6]
Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. Leveraging endpoint flexibility in data-intensive clusters. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (SIGCOMM '13), Hong Kong, China, August 2013.
[7]
Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris. Efficient replica maintenance for distributed storage systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI '06), San Jose, California, May 2006.
[8]
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. Nv-heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '11), New York, New York, March 2011.
[9]
Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Doug Burger, Benjamin C. Lee, and Derrick Coetzee. Better i/o through byte-addressable, persistent memory. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09), Big Sky, Montana, October 2009.
[10]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10), New York, New York, June 2010.
[11]
Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP '07), Stevenson, Washington, October 2007.
[12]
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. System software for persistent memory. In Proceedings of the EuroSys Conference (EuroSys '14), Amsterdam, The Netherlands, April 2014.
[13]
EMC Corporation. EMC VNXe High Availability. https://www.emc.com/collateral/hardware/white-papers/h8276-emc-vnxe-high-availability-wp.pdf.
[14]
Daniel Ford, Franc ?ois Labelle, Florentina I. Popovic i, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, Canada, December 2010.
[15]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), Bolton Landing, New York, October 2003.
[16]
Google Inc. Google Sparse Hash. http://goog-sparsehash.sourceforge.net.
[17]
Jim Gray, Pat Helland, Patrick O'Neil, and Dennis Shasha. The dangers of replication and a solution. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD '96), New York, New York, June 1996.
[18]
Lisa Hellerstein, Garth A. Gibson, Richard M. Karp, Randy H. Katz, and David A. Patterson. Coding Techniques for Handling Failures in Large Disk Arrays. Algorithmica, 12(2):182--208, August 1994.
[19]
Hewlett Packard. HP NonStop operating system. http://h17007.www1.hp.com/us/en/enterprise/servers/integrity/nonstop/nonstop-os.aspx.
[20]
M Hosomi, H Yamagishi, T Yamamoto, K Bessho, Y Higo, K Yamane, H Yamada, M Shoji, H Hachino, C Fukumoto, et al. A novel nonvolatile memory with spin torque transfer magnetization switching: Spin-ram. In Electron Devices Meeting, 2005. IEDM Technical Digest. IEEE International, pages 459--462, 2005.
[21]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference (USENIX '12), Boston, Massachusetts, June 2012.
[22]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIX '10), Boston, Massachusetts, June 2010.
[23]
Intel. Add Support for New Persistent Memory Instructions. http://www.lwn.net/Articles/619851.
[24]
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.
[25]
Engin Ipek, Jeremy Condit, Edmund B. Nightingale, Doug Burger, and Thomas Moscibroda. Dynamically replicated memory: Building reliable systems from nanoscale resistive memories. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIV), Pittsburgh, Pennsylvania, March 2010.
[26]
James Pinkerton. The Future of Computing: The Convergence of Memory and Storage through Non-Volatile Memory (NVM). Storage Industry Summit, San Jose, California, Jan 2014.
[27]
Brian G Johnson and Charles H Dennison. Phase change memory, September 2004. US Patent 6,791,102.
[28]
Brent ByungHoon Kang, Robert Wilensky, and John Kubiatowicz. The hash history approach for reconciling mutual inconsistency. In Proceedings of the 23rd International Conference on Distributed Computing Systems (ICDCS '03), Providence, Rhode Island, May 2003.
[29]
John Kubiatowicz, David Bindel, Patrick Eaton, Yan Chen, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Westley Weimer, Chris Wells, Hakim Weatherspoon, and Ben Zhao. OceanStore: An Architecture for Global-Scale Persistent Storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Cambridge, Massachusetts, November 2000.
[30]
Amit Kumar and Ram Huggahalli. Impact of cache coherence protocols on the processing of network traffic. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '07), Chicago, Illinois, Dec 2007.
[31]
Leslie Lamport. Paxos Made Simple. ACM SIGACT News, 32(4):18--25, November 2001.
[32]
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting phase change memory as a scalable dram alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09), Austin, Texas, June 2009.
[33]
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Phase change memory architecture and the quest for scalability. Commun. ACM, 53(7):99--106, 2010.
[34]
Benjamin C Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger. Phase-change technology and the future of main memory. IEEE micro, 30(1):143, 2010.
[35]
Myoung-Jae Lee, Chang Bum Lee, Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, Chang-Jung Kim, David H Seo, Sunae Seo, et al. A fast, high-endurance and scalable non-volatile memory device made from asymmetric ta2o5- x/tao2- x bilayer structures. Nature materials, 10(8):625--630, 2011.
[36]
Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), Cascais, Portugal, October 2011.
[37]
Mellanox Technologies. Rdma aware networks programming user manual. http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf.
[38]
Micron Technology Inc. P8p parallel phase change memory (pcm). http://www.micron.com/media/Documents/Products/Data%20Sheet/PCM/p8p_parallel_pcm_ds.pdf.
[39]
Jeffrey C. Mogul, Eduardo Argollo, Mehul Shah, and Paolo Faraboschi. Operating system support for nvm+dram hybrid main memory. In The Twelfth Workshop on Hot Topics in Operating Systems (HotOS XII), Monte Verita, Switzerland, May 2009.
[40]
MongoDB Inc. MongoDB. http://www.mongodb.org/.
[41]
Iulian Moraru, David G Andersen, Michael Kaminsky, Niraj Tolia, Parthasarathy Ranganathan, and Nathan Binkert. Consistent, durable, and safe memory management for byte- addressable non volatile main memory. In Conference on Timely Results in Operating Systems (TRIOS '13), Farmington, Pennsylvania, November 2013.
[42]
Suman Nath, Haifeng Yu, Philip B. Gibbons, and Srinivasan Seshan. Subtleties in tolerating correlated failures in wide-area storage systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI '06), San Jose, California, May 2006.
[43]
NetApp Inc. NetApp SnapMirror Data Replication. http://www.netapp.com/us/products/protection-software/snapmirror.aspx.
[44]
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. Fast Crash Recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), Cascais, Portugal, October 2011.
[45]
Stan Park, Terence Kelly, and Kai Shen. Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data. In Proceedings of the EuroSys Conference (EuroSys '13), Prague, Czech Republic, April 2013.
[46]
David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD '88), Chicago, Illinois, June 1988.
[47]
Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marvin M. Theimer, and Alan J. Demers. Flexible Update Propagation for Weakly Consistent Replication. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP '97), Saint-Malo, France, October 1997.
[48]
Moinuddin K Qureshi, Michele M Franceschini, Luis A Lastras-Monta ?no, and John P Karidis. Morphable memory system: a robust architecture for exploiting multi-level phase change memories. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '07), June 2010.
[49]
Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09), Austin, Texas, June 2009.
[50]
Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing (ICS '11), Tucson, Arizona, 2011.
[51]
Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz. Pond: The oceanstore prototype. In Proceedings of the 2nd USENIX Symposium on File and Storage Technologies (FAST '03), San Francisco, California, April 2003.
[52]
Antony Rowstron and Peter Druschel. Storage Management and Caching in PAST, A Large-scale, Persistent Peer-to-peer Storage Utility. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01), Banff, Canada, October 2001.
[53]
David Spence, Jon Crowcroft, Steven Hand, and Tim Harris. Location based placement of whole distributed systems. In Proceedings of the 2005 ACM Conference on Emerging Net- work Experiment and Technology (CoNEXT '05), Toulouse, France, October 2005.
[54]
Sun Microsystems. Solaris Internals: FileBench. http://filebench.sourceforge.net/.
[55]
Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera, and Hussam Abu- Libdeh. Consistency-Based Service Level Agreements for Cloud Storage. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13), Farmington, Pennsylvania, November 2013.
[56]
Robbert van Renesse and Fred B. Schneider. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI '04), San Francisco, California, December 2004.
[57]
VMWare Inc. VMware High Availability. http://www.vmware.com/files/pdf/VMware-High-Availability-DS-EN.pdf.
[58]
Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: Lightweight persistent memory. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '11), New York, New York, March 2011.
[59]
Xiaojian Wu and A.L.N. Reddy. Scmfs: A file system for storage class memory. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11), Nov 2011.
[60]
J Joshua Yang, Dmitri B Strukov, and Duncan R Stewart. Memristive devices for computing. Nature nanotechnology, 8(1):13--24, 2013.
[61]
Ming Zhong, Kai Shen, and Joel Seiferas. Replication degree customization for high availability. In Proceedings of the EuroSys Conference (EuroSys '08), Glasgow, Scotland UK, March 2008.
[62]
Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A durable and energy efficient main memory using phase change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09), Austin, Texas, June 2009.

Cited By

View all
  • (2023)JASS: A Tunable Checkpointing System for NVM-Based Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00032(164-173)Online publication date: 18-Dec-2023
  • (2022)RotorcRaft: Scalable Follower-Driven Raft on RDMADatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_24(293-308)Online publication date: 8-Apr-2022
  • (2020)Hierarchical Orchestration of Disaggregated MemoryIEEE Transactions on Computers10.1109/TC.2020.296852569:6(844-855)Online publication date: 1-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 50, Issue 4
ASPLOS '15
April 2015
676 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775054
  • Editor:
  • Andy Gill
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2015
    720 pages
    ISBN:9781450328357
    DOI:10.1145/2694344
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015
Published in SIGPLAN Volume 50, Issue 4

Check for updates

Author Tags

  1. availability
  2. data center
  3. distributed storage systems
  4. keywords non-volatile memory
  5. reliability
  6. storage-class memory

Qualifiers

  • Research-article

Funding Sources

  • SRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)298
  • Downloads (Last 6 weeks)56
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)JASS: A Tunable Checkpointing System for NVM-Based Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00032(164-173)Online publication date: 18-Dec-2023
  • (2022)RotorcRaft: Scalable Follower-Driven Raft on RDMADatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_24(293-308)Online publication date: 8-Apr-2022
  • (2020)Hierarchical Orchestration of Disaggregated MemoryIEEE Transactions on Computers10.1109/TC.2020.296852569:6(844-855)Online publication date: 1-Jun-2020
  • (2017)Megalloc: Fast Distributed Memory Allocator for NVM-Based Cluster2017 International Conference on Networking, Architecture, and Storage (NAS)10.1109/NAS.2017.8026865(1-9)Online publication date: Aug-2017
  • (2017)PTree: Direct Lookup with Page Table Tree for NVM File Systems2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)10.1109/DASC-PICom-DataCom-CyberSciTec.2017.186(1160-1167)Online publication date: Nov-2017
  • (2017)Pyramid: Revisiting Memory Extension with Remote Accessible Non-Volatile Main MemorySecurity, Privacy, and Anonymity in Computation, Communication, and Storage10.1007/978-3-319-72395-2_65(730-743)Online publication date: 9-Dec-2017
  • (2024)Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value StoresProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695951(127-143)Online publication date: 4-Nov-2024
  • (2023)FUSEEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585944(81-97)Online publication date: 21-Feb-2023
  • (2023)Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared MemoryProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613135(658-674)Online publication date: 23-Oct-2023
  • (2023)Persistent Memory Disaggregation for Cloud-Native Relational DatabasesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582055(498-512)Online publication date: 25-Mar-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media