Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3477132.3483544acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Log-structured Protocols in Delos

Published: 26 October 2021 Publication History

Abstract

Developers have access to a wide range of storage APIs and functionality in large-scale systems, such as relational databases, key-value stores, and namespaces. However, this diversity comes at a cost: each API is implemented by a complex distributed system that is difficult to develop and operate. Delos amortizes this cost by enabling different APIs on a shared codebase and operational platform. The primary innovation in Delos is a log-structured protocol: a fine-grained replicated state machine executing above a shared log that can be layered into reusable protocol stacks under different databases. We built and deployed two production databases using Delos at Facebook, creating nine different log-structured protocols in the process. We show via experiments and production data that log-structured protocols impose low overhead, while allowing optimizations that can improve latency by up to 100X (e.g., via leasing) and throughput by up to 2X (e.g., via batching).

References

[1]
LogDevice. https://logdevice.io/.
[2]
redis. https://redis.io/.
[3]
Rocksdb. https://rocksdb.org/.
[4]
Sqlite. https://www.sqlite.org/.
[5]
Adya, A., Grandl, R., Myers, D., and Qin, H. Fast key-value stores: An idea whose time has come and gone. In HotOS 2019.
[6]
Aguilera, M. K., Leners, J. B., and Walfish, M. Yesquel: Scalable sql storage for web applications. In ACM SOSP 2015.
[7]
Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., van Hovell, H., Ionescu, A., Łuszczak, A., et al. Delta lake: high-performance acid table storage over cloud object stores. In VLDB 2020.
[8]
Azagury, A., Factor, M. E., Satran, J., and Micka, W. Point-in-time copy: Yesterday, today and tomorrow. In IEEE MSST 2002.
[9]
Balakrishnan, M., Flinn, J., Shen, C., Dharamshi, M., Jafri, A., Shi, X., Ghosh, S., Hassan, H., Sagar, A., Shi, R., et al. Virtual Consensus in Delos. In USENIX OSDI 2020.
[10]
Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobber, T., Wei, M., and Davis, J. D. CORFU: A Shared Log Design for Flash Clusters. In USENIX NSDI 2012.
[11]
Balakrishnan, M., Malkhi, D., Wobber, T., Wu, M., Prabhakaran, V., Wei, M., Davis, J. D., Rao, S., Zou, T., and Zuck, A. Tango: Distributed Data Structures over a Shared Log. In ACM SOSP 2013.
[12]
Bernstein, P. A., Das, S., Ding, B., and Pilman, M. Optimizing Optimistic Concurrency Control for Tree-Structured, Log-Structured Databases. In Proceedings of ACM SIGMOD 2015.
[13]
Bittman, D., Alvaro, P., Mehra, P., Long, D. D., and Miller, E. L. Twizzler: a Data-Centric OS for Non-Volatile Memory. In USENIX ATC 2020.
[14]
Burrows, M. The Chubby lock service for loosely-coupled distributed systems. In USENIX OSDI 2006.
[15]
Cao, Z., Dong, S., Vemuri, S., and Du, D. H. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In USENIX FAST 2020.
[16]
Chrysafis, C., Collins, B., Dugas, S., Dunkelberger, J., Ehsan, M., Gray, S., Grieser, A., Herrnstadt, O., Lev-Ari, K., Lin, T., McMahon, M., Schiefer, N., and Shraer, A. FoundationDB record layer: A Multi-Tenant Structured Datastore. In ACM SIGMOD 2019.
[17]
Clark, D. D. The structuring of systems using upcalls. In ACM SOSP 1985.
[18]
Coburn, J., Caulfield, A. M., Akel, A., Grupp, L. M., Gupta, R. K., Jhala, R., and Swanson, S. NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories. In ACM ASPLOS 2011.
[19]
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., et al. Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 1--22.
[20]
Cui, H., Gu, R., Liu, C., Chen, T., and Yang, J. PAXOS Made Transparent. In ACM SOSP 2015.
[21]
Cui, H., Simsa, J., Lin, Y.-H., Li, H., Blum, B., Xu, X., Yang, J., Gibson, G. A., and Bryant, R. E. Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), pp. 388---405.
[22]
DeWitt, D. J., Katz, R. H., Olken, F., Shapiro, L. D., Stonebraker, M. R., and Wood, D. A. Implementation techniques for main memory database systems. In ACM SIGMOD 1984.
[23]
Ding, C., Chu, D., Zhao, E., Li, X., Alvisi, L., and van Renesse, R. Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log. In USENIX NSDI 2020.
[24]
Dragojević, A., Narayanan, D., Castro, M., and Hodson, O. FaRM: Fast Remote Memory. In USENIX NSDI 2014.
[25]
Friedman, M., Herlihy, M., Marathe, V., and Petrank, E. A Persistent Lock-Free Queue for Non-Volatile Memory. ACM SIGPLAN Notices 53, 1 (2018), 28--40.
[26]
Garbinato, B., and Guerraoui, R. Flexible protocol composition in BAST. In ICDCS 1998.
[27]
Gray, C., and Cheriton, D. Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency. In ACM SOSP 1989.
[28]
Guy, R. G., Heidemann, J. S., Mak, W., Page Jr, T. W., Popek, G. J., and Rothmeier, D. Implementation of the Ficus Replicated File System. In USENIX Summer 1990.
[29]
Heidemann, J. S., and Popek, G. J. File-System Development with Stackable Layers. ACM Transactions on Computer Systems (TOCS) 12, 1 (1994), 58--89.
[30]
Herlihy, M. P., and Wing, J. M. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463--492.
[31]
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In USENIX ATC 2010.
[32]
Hutchinson, N. C., and Peterson, L. L. The x-kernel: An Architecture for Implementing Network Protocols. IEEE Transactions on Software engineering 17, 1 (1991), 64.
[33]
Jia, Z., and Witchel, E. Boki: Stateful Serverless Computing with Shared Logs. In ACM SOSP 2021.
[34]
Junqueira, F. P., Reed, B. C., and Serafini, M. Zab: High-performance broadcast for primary-backup systems. In IEEE DSN 2011.
[35]
Khalidi, Y. A., and Nelson, M. N. Extensible File Systems in Spring. In ACM SOSP 1993.
[36]
Kogias, M., and Bugnion, E. HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services. In ACM EuroSys 2020.
[37]
Kulkarni, C., Moore, S., Naqvi, M., Zhang, T., Ricci, R., and Stutsman, R. Splinter: Bare-Metal Extensions for Multi-Tenant Low-Latency Storage. In USENIX OSDI 2018.
[38]
Lamport, L. The Part-Time Parliament. ACM Transactions on Computer Systems (TOCS) 16, 2 (1998), 133--169.
[39]
Lee, C., Park, S. J., Kejriwal, A., Matsushita, S., and Ousterhout, J. Implementing Linearizability at Large Scale and Low Latency. In ACM SOSP 2015.
[40]
Li, T., Chandramouli, B., Faleiro, J. M., Madden, S., and Kossmann, D. Asynchronous Prefix Recoverability for Fast Distributed Stores. In ACM SIGMOD 2021.
[41]
Liu, T., Curtsinger, C., and Berger, E. D. Dthreads: efficient deterministic multithreading. In ACM SOSP 2011.
[42]
Liu, X., Kreitz, C., van Renesse, R., Hickey, J., Hayden, M., Birman, K., and Constable, R. Building Reliable, High-Performance Communication Systems from Components. In ACM SOSP 1999.
[43]
Lorch, J. R., Adya, A., Bolosky, W. J., Chaiken, R., Douceur, J. R., and Howell, J. The SMART Way to Migrate Replicated Stateful Services. In ACM EuroSys 2006.
[44]
Ongaro, D., and Ousterhout, J. K. In Search of an Understandable Consensus Algorithm. In USENIX ATC 2014.
[45]
Ostrowski, K., Birman, K., Dolev, D., and Ahnn, J. H. Programming with Live Distributed Objects. In ECOOP 2008.
[46]
Pedone, F., Guerraoui, R., and Schiper, A. The Database State Machine Approach. Distributed and Parallel Databases 14, 1 (2003), 71--98.
[47]
Peng, D., and Dabek, F. Large-scale Incremental Processing Using Distributed Transactions and Notifications. In USENIX OSDI 2010.
[48]
Ritchie, D. M. The UNIX System: A Stream Input-Output System. AT&T Bell Laboratories Technical Journal 63, 8 (1984), 1897--1910.
[49]
Schneider, F. B. Implementing Fault-tolerant Services using the State Machine Approach: A Tutorial. ACM Computing Surveys (CSUR) 22, 4 (1990), 299--319.
[50]
Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Littlefield, M. O. K., Menestrina, D., Cieslewicz, S. E. J., Rae, I., Stancescu, T., and Apte, H. F1: A Distributed SQL Database That Scales. In VLDB 2013.
[51]
Tang, C., Yu, K., Veeraraghavan, K., Kaldor, J., Michelson, S., Kooburat, T., Anbudurai, A., Clark, M., Gogia, K., Cheng, L., Christensen, B., Gartrell, A., Khutornenko, M., Kulkarni, S., Pawlowski, M., Pelkonen, T., Rodrigues, A., Tibrewal, R., Pawlowski, M., Pelkonen, T., Rodrigues, A., Tibrewal, R., Venkatesan, V., and Zhang, P. Twine: A Unified Cluster Management System for Shared Infrastructure. In USENIX OSDI 2020.
[52]
Van Renesse, R., and Altinbuken, D. Paxos Made Moderately Complex. ACM Computing Surveys (CSUR) 47, 3 (2015), 1--36.
[53]
van Renesse, R., Birman, K. P., Friedman, R., Hayden, M., and Karr, D. A. A Framework for Protocol Composition in Horus. In ACM PODC 1995.
[54]
Van Renesse, R., Birman, K. P., and Maffeis, S. Horus: A Flexible Group Communication System. Communications of the ACM 39, 4 (1996), 76--83.
[55]
Wei, M., Tai, A., Rossbach, C. J., Abraham, I., Munshed, M., Dhawan, M., Stabile, J., Wieder, U., Fritchie, S., Swanson, S., et al. vCorfu: A Cloud-Scale Object Store on a Shared Log. In USENIX NSDI 2017.
[56]
You, J., Wu, J., Jin, X., and Chowdhury, M. Ship Compute or Ship Data? Why Not Both? In USENIX NSDI 2021.
[57]
Zhang, W., Shenker, S., and Zhang, I. Persistent state machines for recoverable in-memory storage systems with nvram. In USENIX OSDI 2020.
[58]
Zimmermann, H. OSI Reference Model - The ISO Model of Architecture for Open Systems Interconnection. IEEE Transactions on Communications 28, 4 (1980), 425--432.

Cited By

View all
  • (2024)The Key Ideas Behind Boki's Shared LogsACM SIGOPS Operating Systems Review10.1145/3689051.368905458:1(7-14)Online publication date: 14-Aug-2024
  • (2024)IndiLog: Bridging Scalability and Performance in Stateful Serverless Computing with Shared LogsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689159(1-13)Online publication date: 16-Sep-2024
  • (2024)Boki: Towards Data Consistency and Fault Tolerance with Shared Logs in Stateful Serverless ComputingACM Transactions on Computer Systems10.1145/365307242:3-4(1-35)Online publication date: 12-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles
October 2021
899 pages
ISBN:9781450387095
DOI:10.1145/3477132
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2021

Check for updates

Author Tags

  1. Consensus
  2. State Machine Replication

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOSP '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)5
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Key Ideas Behind Boki's Shared LogsACM SIGOPS Operating Systems Review10.1145/3689051.368905458:1(7-14)Online publication date: 14-Aug-2024
  • (2024)IndiLog: Bridging Scalability and Performance in Stateful Serverless Computing with Shared LogsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689159(1-13)Online publication date: 16-Sep-2024
  • (2024)Boki: Towards Data Consistency and Fault Tolerance with Shared Logs in Stateful Serverless ComputingACM Transactions on Computer Systems10.1145/365307242:3-4(1-35)Online publication date: 12-Sep-2024
  • (2024)Optimizing Distributed Protocols with Query RewritesProceedings of the ACM on Management of Data10.1145/36392572:1(1-25)Online publication date: 26-Mar-2024
  • (2023)Fine-Grained Re-Execution for Efficient Batched Commit of Distributed TransactionsProceedings of the VLDB Endowment10.14778/3594512.359452316:8(1930-1943)Online publication date: 1-Apr-2023
  • (2023)Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless ComputingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613154(314-330)Online publication date: 23-Oct-2023
  • (2023)DARQ Matter Binds Everything: Performant and Composable Cloud Programming via Resilient StepsProceedings of the ACM on Management of Data10.1145/35892621:2(1-27)Online publication date: 20-Jun-2023
  • (2023)FlexLog: A Shared Log for Stateful Serverless ComputingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592993(195-209)Online publication date: 7-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media