Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Rearchitecting in-memory object stores for low latency

Published: 01 November 2021 Publication History

Abstract

Low latency is increasingly critical for modern workloads, to the extent that compute functions are explicitly scheduled to be co-located with their in-memory object stores for faster access. However, the traditional object store architecture mandates that clients interact with the server via inter-process communication (IPC). This poses a significant performance bottleneck for low-latency workloads. Meanwhile, in many important emerging AI workloads, such as parallel tree search and reinforcement learning, all the worker processes accessing the object store belong to a single user.
We design Lightning, an in-memory object store rearchitected for modern, low-latency workloads in a single-user, multi-process setting. Lightning departs from the traditional design by adopting a shared memory model, enabling clients to directly access the object store without IPC boundary. Instead, client isolation is achieved by a novel integration of Intel Memory Protect Keys (MPK) hardware, transaction logging, and formal verification. Our evaluations show that Lightning outperforms state-of-the-art in-memory object stores by up to 9.0x on five standard NoSQL workloads and up to 4.5x in scaling up a Python tree search program. Lightning improves the throughput of a popular reinforcement learning framework that uses an in-memory object store for data sharing by up to 40%.

References

[1]
Atul Adya, Robert Grandl, Daniel Myers, and Henry Qin. 2019. Fast Key-Value Stores: An Idea Whose Time Has Come and Gone. In HotOS.
[2]
Abutalib Aghayev, Sage Weil, Michael Kuchnik, Mark Nelson, Gregory R. Ganger, and George Amvrosiadis. 2019. File Systems Unfit as Distributed Storage Back-ends: Lessons from 10 Years of Ceph Evolution. In SOSP.
[3]
Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, Andre Martin, Christian Priebe, Joshua Lind, Divya Muthukumaran, Dan O'Keeffe, Mark L. Stillwell, David Goltzsche, Dave Eyers, Rüdiger Kapitza, Peter Pietzuch, and Christof Fetzer. 2016. SCONE: Secure Linux Containers with Intel SGX. In OSDI.
[4]
arrow 2020. Apache Arrow: Powering In-Memory Analytics. https://github.com/apache/arrow.
[5]
Maurice Bailleu, Jörg Thalheim, Pramod Bhatotia, Christof Fetzer, Michio Honda, and Kapil Vaswani. 2019. Speicher: Securing LSM-Based Key-Value Stores Using Shielded Execution. In FAST.
[6]
Andrew Baumann, Marcus Peinado, and Galen Hunt. 2014. Shielding Applications from an Untrusted Cloud with Haven. In OSDI.
[7]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The Arcade Learning Environment: an Evaluation Platform for General Agents. Journal of Artificial Intelligence Research 47 (2013), 253--279.
[8]
Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. 1991. User-Level Interprocess Communication for Shared Memory Multiprocessors. ACM Trans. Comput. Syst. 9, 2 (May 1991), 175--198.
[9]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016).
[10]
Tej Chajed, Haogang Chen, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich, and Daniel Ziegler. 2017. Certifying a File System Using Crash Hoare Logic: Correctness in the Presence of Crashes. Commun. ACM 60, 4 (March 2017), 75--84.
[11]
Tej Chajed, Frans Kaashoek, Butler Lampson, and Nickolai Zeldovich. 2018. Verifying Concurrent Software using Movers in {CSPEC }. In OSDI.
[12]
Chia che Tsai, Jeongseok Son, Bhushan Jain, John McAvey, Raluca Ada Popa, and Donald E. Porter. 2020. Civet: An Efficient Java Partitioning Framework for Hardware Enclaves. In USENIX Security.
[13]
Haogang Chen, Tej Chajed, Alex Konradi, Stephanie Wang, Atalay İleri, Adam Chlipala, M Frans Kaashoek, and Nickolai Zeldovich. 2017. Verifying a High-Performance Crash-Safe File System using a Tree Specification. In SOSP.
[14]
Jiqiang Chen, Liang Chen, Sheng Wang, Guoyun Zhu, Yuanyuan Sun, Huan Liu, and Feifei Li. 2020. HotRing: A Hotspot-Aware In-Memory Key-Value Store. In FAST.
[15]
coq 2020. The Coq Proof Assistant. https://coq.inria.fr/.
[16]
Rémi Coulom. 2006. Efficient Selectivity and Backup Operators in Monte-Carlo Tree search. In International conference on computers and games. Springer, 72--83.
[17]
Anthony Demeri, Wook-Hee Kim, R. Madhava Krishnan, Jaeho Kim, Mohannad Ismail, and Changwoo Min. 2020. Poseidon: Safe, Fast and Scalable Persistent Memory Allocator. In Middleware.
[18]
Diego Didona and Willy Zwaenepoel. 2019. Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores. In NSDI.
[19]
dlmalloc 2020. A Memory Allocator. http://gee.cs.oswego.edu/dl/html/malloc.html.
[20]
Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and Haibo Chen. 2019. Performance and Protection in the ZoFS User-Space NVM File System.
[21]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast Remote Memory. In NSDI.
[22]
Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In NSDI.
[23]
Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel, and Bryan Parno. 2017. Komodo: Using Verification to Disentangle Secure-Enclave Hardware from Software. In SOSP.
[24]
Sylvain Gelly and David Silver. 2011. Monte-Carlo Tree Search and Rapid Action Value Estimation in Computer Go. Artificial Intelligence 175, 11 (2011), 1856--1875.
[25]
Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, and Irving Traiger. 1981. The Recovery Manager of the System R Database Manager. ACM Comput. Surv. 13, 2 (June 1981), 223--242.
[26]
Jinyu Gu, Xinyue Wu, Wentai Li, Nian Liu, Zeyu Mi, Yubin Xia, and Haibo Chen. 2020. Harmonizing Performance and Isolation in Microkernels with Efficient Intra-kernel Isolation and Communication. In USENIX ATC.
[27]
Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan (Newman) Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. 2016. CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels. In OSDI.
[28]
Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. 2014. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. In NeurIPS.
[29]
Theo Haerder and Andreas Reuter. 1983. Principles of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15, 4 (Dec. 1983), 287--317.
[30]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: Proving Practical Distributed Systems Correct. In SOSP.
[31]
Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Arjun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill. 2014. Ironclad Apps: End-to-End Security via Automated Full-System Verification. In OSDI.
[32]
Mohammad Hedayati, Spyridoula Gravani, Ethan Johnson, John Criswell, Michael L. Scott, Kai Shen, and Mike Marty. 2019. Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries. In USENIX ATC.
[33]
hiredis 2020. Minimalistic C client for Redis >= 1.2. https://github.com/redis/hiredis.
[34]
Zhichao Hua, Jinyu Gu, Yubin Xia, Haibo Chen, Binyu Zang, and Haibing Guan. 2017. vTZ: Virtualizing ARM TrustZone. In USENIX Security.
[35]
Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. 2014. NetVM: High Performance and Flexible Networking Using Virtualization on Commodity Platforms. In NSDI.
[36]
jackson 2020. Jackson Project Home. https://github.com/FasterXML/jackson.
[37]
jemalloc 2020. jemalloc. http://jemalloc.net/.
[38]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In SOSP.
[39]
Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds. In NSDI.
[40]
Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. 2009. SeL4: Formal Verification of an OS Kernel. In SOSP.
[41]
Robert Krahn, Bohdan Trach, Anjo Vahldiek-Oberwagner, Thomas Knauth, Pramod Bhatotia, and Christof Fetzer. 2018. Pesos: Policy Enhanced Secure Object Store. In EuroSys.
[42]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 75--86.
[43]
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (Dakar, Senegal) (LPAR'10). Springer-Verlag, Berlin, Heidelberg, 348--370.
[44]
Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, and Lintao Zhang. 2017. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. In SOSP.
[45]
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. RLlib: Abstractions for Distributed Reinforcement Learning. In ICML.
[46]
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage. In NSDI.
[47]
David E. Lowell and Peter M. Chen. 1997. Free Transactions with Rio Vista. In SOSP.
[48]
mctspy 2020. mctspy. https://pypi.org/project/mctspy/.
[49]
memcached 2020. Memcached. https://memcached.org/.
[50]
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In ICML.
[51]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In OSDI.
[52]
Luke Nelson, Jacob Van Geffen, Emina Torlak, and Xi Wang. 2020. Specification and verification in the field: Applying formal methods to BPF just-in-time compilers in the Linux kernel. In OSDI.
[53]
Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, Dylan Johnson, James Bornholt, Emina Torlak, and Xi Wang. 2017. Hyperkernel: Push-Button Verification of an OS Kernel. In SOSP.
[54]
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast Crash Recovery in RAM Cloud. In SOSP.
[55]
Soyeon Park, Sangho Lee, Wen Xu, HyunGon Moon, and Taesoo Kim. 2019. libmpk: Software Abstraction for Intel Memory Protection Keys (Intel MPK). In USENIX ATC.
[56]
plasma 2020. The Plasma In-Memory Object Store. https://arrow.apache.org/docs/python/plasma.html.
[57]
plasmaclient 2020. Plasma Client. https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.h.
[58]
postgreslog 2020. Chapter 29. Reliability and the Write-Ahead Log. https://www.postgresql.org/docs/9.1/wal-intro.html.
[59]
Vijayan Prabhakaran, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2005. Analysis and Evolution of Journaling File Systems. In USENIX ATC.
[60]
redis 2020. Redis. https://redis.io/.
[61]
Mendel Rosenblum and John K. Ousterhout. 1992. The Design and Implementation of a Log-Structured File System. ACM Trans. Comput. Syst. 10, 1 (Feb. 1992), 26--52.
[62]
M. Satyanarayanan, Henry H. Mashburn, Puneet Kumar, David C. Steere, and James J. Kistler. 1993. Lightweight Recoverable Virtual Memory. In SOSP.
[63]
Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang. 2016. PushButton Verification of File Systems via Crash Refinement. In OSDI.
[64]
Helgi Sigurbjarnarson, Luke Nelson, Bruno Castro-Karney, James Bornholt, Emina Torlak, and Xi Wang. 2018. Nickel: A Framework for Design and Verification of Information Flow Control Systems. In OSDI.
[65]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529, 7587 (2016), 484.
[66]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2018. A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-play. Science 362, 6419 (2018), 1140--1144.
[67]
Mincheol Sung, Pierre Olivier, Stefan Lankes, and Binoy Ravindran. 2020. Intra-Unikernel Isolation with Intel Memory Protection Keys. In VEE.
[68]
Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. 2017. Elf: An Extensive, Lightweight and Flexible Research Platform for Real-Time Strategy Games. In NeurIPS.
[69]
Chia-Che Tsai, Donald E. Porter, and Mona Vij. 2017. Graphene-SGX: A Practical Library OS for Unmodified Applications on SGX. In USENIX ATC.
[70]
Anjo Vahldiek-Oberwagner, Eslam Elnikety, Nuno O. Duarte, Michael Sammler, Peter Druschel, and Deepak Garg. 2019. ERIM: Secure, Efficient In-process Isolation with Protection Keys (MPK). In USENIX Security.
[71]
Kefei Wang, Jian Liu, and Feng Chen. 2020. Put an Elephant into a Fridge: Optimizing Cache Efficiency for in-Memory Key-Value Stores. PVLDB 13, 9 (May 2020).
[72]
Stephanie Wang, John Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, and Ion Stoica. 2019. Lineage Stash: Fault Tolerance off the Critical Path. In SOSP.
[73]
whitedb 2020. Whitedb. http://whitedb.org/.
[74]
ycsb 2020. Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/YCSB/.
[75]
Arseniy Zaostrovnykh, Solal Pirelli, Rishabh Iyer, Matteo Rizzo, Luis Pedrosa, KaterinaArgyraki, and George Candea. 2019. Verifying Software Network Functions with No Verification Expertise. In SOSP.
[76]
Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication. In FAST.
[77]
Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang. 2015. Mega-KV: A Case for GPUs to Maximize the Throughput of in-Memory Key-Value Stores. PVLDB 8, 11 (July 2015).
[78]
Kaiyuan Zhang, Danyang Zhuo, Aditya Akella, Arvind Krishnamurthy, and Xi Wang. 2020. Automated Verification of Customizable Middlebox Properties with Gravel. In NSDI.
[79]
Siyuan Zhuang, Zhuohan Li, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, and Ion Stoica. 2021. Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems. In SIG-COMM.
[80]
Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, and Haibo Chen. 2019. Using Concurrent Relational Logic with Helpers for Verifying the AtomFS File System. In SOSP.

Cited By

View all
  • (2024)Scalable and effective page-table and TLB management on NUMA systemsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692020(445-461)Online publication date: 10-Jul-2024
  • (2023)Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared MemoryProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613135(658-674)Online publication date: 23-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 3
November 2021
364 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2021
Published in PVLDB Volume 15, Issue 3

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)3
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scalable and effective page-table and TLB management on NUMA systemsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692020(445-461)Online publication date: 10-Jul-2024
  • (2023)Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared MemoryProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613135(658-674)Online publication date: 23-Oct-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media