Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

Lotus: scalable multi-partition transactions on single-threaded partitioned databases

Published: 01 July 2022 Publication History


This paper revisits the H-Store/VoltDB concurrency control scheme for partitioned main-memory databases, which we term run-to-completion-single-thread (RCST), with an eye toward improving its poor performance on multi-partition (MP) workloads. The original scheme focused on maximizing single partition (SP) performance, producing results in millions of transactions per second on modest clusters, but at the expense of dismal MP performance. In this paper, we show that original RCST algorithms be made to dramatically improve MP performance with very limited impact on SP performance. That makes RCST superior to popular optimistic and pessimistic schemes without optimizations for batch execution, including OCC and 2PL, on a wide range of multi-node workloads with up to 60% throughput improvement.
Our second contribution is to propose a multiplexed-execution-single-thread (MEST) algorithm based on RCST to amortize the network stalls from MP transactions over a batch of MP transactions. This scheme delivers up to 21X higher throughput for SP transactions and comparable MP throughput compared to state-of-the-art distributed deterministic concurrency control algorithms that are optimized for batch execution. Finally, our MEST scheme offers dramatically superior performance when straggler transactions are present in the workload. Our conclusion is that the H-Store/VoltDB concurrency control scheme can be dramatically improved and dominates state-of-the-art algorithms over a variety of MP workloads.


[n.d.]. Gitlab.com Database Incident. https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/
[n.d.]. H-Store. http://hstore.cs.brown.edu.
[n.d.]. VoltDB. http://voltdb.com.
Daniel J Abadi and Jose M Faleiro. 2018. An overview of deterministic database systems. Commun. ACM 61, 9 (2018), 78--88.
Maha Abdallah, Rachid Guerraoui, and Philippe Pucheral. 1998. One-phase commit: does it make sense?. In Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No. 98TB100250). IEEE, 182--192.
Jung-Sang Ahn, Woon-Hak Kang, Kun Ren, Guogen Zhang, and Sami Ben-Romdhane. 2019. Designing an efficient replicated log store with consensus protocol. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19).
Y Al-Houmaily and P Chrysanthis. 1995. Two-phase commit in gigabit-networked distributed databases. In Int. Conf. on Parallel and Distributed Computing Systems (PDCS). Citeseer.
Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica. 2016. Scalable atomic visibility with RAMP transactions. ACM Transactions on Database Systems (TODS) 41, 3 (2016), 1--45.
Philip A Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency control and recovery in database systems. Vol. 370. Addison-wesley Reading.
Thomas C Bressoud and Fred B Schneider. 1996. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems (TOCS) 14, 1 (1996), 80--107.
Navin Budhiraja, Keith Marzullo, Fred B Schneider, and Sam Toueg. 1993. The primary-backup approach. Distributed systems 2 (1993), 199--216.
Michael Cafarella, David DeWitt, Vijay Gadepally, Jeremy Kepner, Christos Kozyrakis, Tim Kraska, Michael Stonebraker, and Matei Zaharia. 2020. Dbos: A proposal for a data-centric operating system. arXiv preprint arXiv:2007.11112 (2020).
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In SoCC. 143--154.
James Cowling and Barbara Liskov. 2012. Granola: {Low-Overhead} Distributed Transaction Coordination. In 2012 USENIX Annual Technical Conference (USENIX ATC 12). 223--235.
Natacha Crooks, Matthew Burke, Ethan Cecchetti, Sitar Harel, Rachit Agarwal, and Lorenzo Alvisi. 2018. Obladi: Oblivious serializable transactions in the cloud. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 727--743.
Carlo Curino, Evan Philip Charles Jones, Yang Zhang, and Samuel R Madden. 2010. Schism: a workload-driven approach to database replication and partitioning. (2010).
Hung Dang, Tien Tuan Anh Dinh, Dumitrel Loghin, Ee-Chien Chang, Qian Lin, and Beng Chin Ooi. 2019. Towards scaling blockchain systems via sharding. In Proceedings of the 2019 international conference on management of data. 123--140.
Bailu Ding, Lucja Kot, and Johannes Gehrke. 2018. Improving optimistic concurrency control through transaction batching and operation reordering. Proceedings of the VLDB Endowment 12, 2 (2018), 169--182.
Kapali P. Eswaran, Jim N Gray, Raymond A. Lorie, and Irving L. Traiger. 1976. The notions of consistency and predicate locks in a database system. Commun. ACM 19, 11 (1976), 624--633.
Jose M. Faleiro and Daniel J. Abadi. 2015. Rethinking Serializable Multiversion Concurrency Control. Proc. VLDB Endow. 8, 11 (2015), 1190--1201.
Jose M Faleiro, Daniel J Abadi, and Joseph M Hellerstein. 2017. High performance transactions via early write visibility. Proceedings of the VLDB Endowment 10, 5 (2017).
Cary Gray and David Cheriton. 1989. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. ACM SIGOPS Operating Systems Review 23, 5 (1989), 202--210.
Zhihan Guo, Xinyu Zeng, Ziwei Ren, and Xiangyao Yu. 2021. Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation. arXiv preprint arXiv:2102.10185 (2021).
Suyash Gupta and Mohammad Sadoghi. 2018. EasyCommit: A Non-blocking Two-phase Commit Protocol. In EDBT. 157--168.
Theo Haerder and Andreas Reuter. 1983. Principles of transaction-oriented database recovery. ACM computing surveys (CSUR) 15, 4 (1983), 287--317.
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017. An evaluation of distributed concurrency control. Proceedings of the VLDB Endowment 10, 5 (2017), 553--564.
Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. 2008. OLTP through the looking glass, and what we found there. In SIGMOD. 981--992.
Jelle Hellings, Daniel P Hughes, Joshua Primero, and Mohammad Sadoghi. 2020. Cerberus: Minimalistic multi-shard byzantine-resilient transaction processing. arXiv preprint arXiv:2008.04450 (2020).
Jelle Hellings and Mohammad Sadoghi. 2021. Byshard: Sharding in a byzantine environment. Proceedings of the VLDB Endowment 14, 11 (2021), 2230--2243.
Patrick Hunt, Mahadev Konar, Flavio P Junqueira, and Benjamin Reed. 2010. {ZooKeeper}: Wait-free Coordination for Internet-scale Systems. In 2010 USENIX Annual Technical Conference (USENIX ATC 10).
Ryan Johnson, Ippokratis Pandis, Nikos Hardavellas, Anastasia Ailamaki, and Babak Falsafi. 2009. Shore-MT: a scalable storage manager for the multicore era. In EDBT. 24--35.
Evan P.C. Jones, Daniel J. Abadi, and Samuel Madden. 2010. Low overhead concurrency control for partitioned main memory databases. In SIGMOD. 603--614.
Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. 2008. H-Store: A High-Performance, Distributed Main Memory Transaction Processing System. In VLDB. 1496--1499.
Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pandis. 2016. Ermia: Fast memory-optimized database system for heterogeneous workloads. In SIGMOD. 1675--1687.
Evan Klitzke. 2020. Why uber engineering switched from postgres to mysql. https://eng.uber.com/postgres-to-mysql-migration/
Hsiang-Tsung Kung and John T Robinson. 1981. On optimistic methods for concurrency control. ACM Transactions on Database Systems (TODS) 6, 2 (1981), 213--226.
Leslie Lamport. 2019. The part-time parliament. In Concurrency: the Works of Leslie Lamport. 277--317.
Leslie Lamport, Robert Shostak, and Marshall Pease. 2019. The Byzantine generals problem. In Concurrency: the Works of Leslie Lamport. 203--226.
Inseon Lee and Heon Young Yeom. 2002. A single phase distributed commit protocol for main memory database systems. In Proceedings 16th International Parallel and Distributed Processing Symposium. IEEE, 8--pp.
Qian Li, Peter Kraft, Kostis Kaffes, Athinagoras Skiadopoulos, Deeptaanshu Kumar, Jason Li, Michael Cafarella, Goetz Graefe, Jeremy Kepner, Christos Kozyrakis, et al. [n.d.]. A Progress Report on DBOS: A Database-oriented Operating System. ([n.d.]).
Yi Lu, Xiangyao Yu, Lei Cao, and Samuel Madden. 2020. Aria: a fast and practical deterministic OLTP database. Proceedings of the VLDB Endowment 13, 12 (2020), 2047--2060.
Yi Lu, Xiangyao Yu, Lei Cao, and Samuel Madden. 2021. Epoch-based commit and replication in distributed OLTP databases. Proceedings of the VLDB Endowment 14, 5 (2021), 743--756.
Yi Lu, Xiangyao Yu, and Samuel Madden. 2018. Star: Scaling transactions through asymmetric replication. arXiv preprint arXiv:1811.02059 (2018).
Nirmesh Malviya, Ariel Weisberg, Samuel Madden, and Michael Stonebraker. 2014. Rethinking main memory OLTP recovery. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 604--615.
C Mohan and Bruce Lindsay. 1985. Efficient commit protocols for the tree of processes model of distributed transactions. ACM SIGOPS Operating Systems Review 19, 2 (1985), 40--52.
C Mohan, Bruce Lindsay, and Ron Obermarck. 1986. Transaction management in the R* distributed database management system. ACM Transactions on Database Systems (TODS) 11, 4 (1986), 378--396.
Iulian Moraru, David G Andersen, and Michael Kaminsky. 2012. Egalitarian paxos. In ACM Symposium on Operating Systems Principles.
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14). 305--319.
Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia Ailamaki. 2010. Data-oriented transaction execution. Proceedings of the VLDB Endowment 3, ARTICLE (2010), 928--939.
Andrew Pavlo. 2014. On scalable transaction execution in partitioned main memory database management systems. Ph.D. Dissertation. PhD thesis, Brown University.
Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 61--72.
Thamir Qadah, Suyash Gupta, and Mohammad Sadoghi. 2020. Q-Store: Distributed, Multi-partition Transactions via Queue-oriented Execution and Communication. In EDBT. 73--84.
Thamir M Qadah and Mohammad Sadoghi. 2018. Quecc: A queue-oriented, control-free concurrency architecture. In Proceedings of the 19th International Middleware Conference. 13--25.
Kun Ren, Dennis Li, and Daniel J Abadi. 2019. Slog: Serializable, low-latency, geo-replicated transactions. Proceedings of the VLDB Endowment 12, 11 (2019), 1747--1761.
Daniel J Scales, Mike Nelson, and Ganesh Venkitachalam. 2010. The design of a practical system for fault-tolerant virtual machines. ACM SIGOPS Operating Systems Review 44, 4 (2010), 30--39.
Dale Skeen. 1981. Nonblocking commit protocols. In Proceedings of the 1981 ACM SIGMOD international conference on Management of data. 133--142.
Athinagoras Skiadopoulos, Qian Li, Peter Kraft, Kostis Kaffes, Daniel Hong, Shana Mathew, David Bestor, Michael Cafarella, Vijay Gadepally, Goetz Graefe, et al. 2021. DBOS: a DBMS-oriented Operating System. Proceedings of the VLDB Endowment 15, 1 (2021), 21--30.
James W Stamos and Flaviu Cristian. 1990. A low-cost atomic commit protocol. In Proceedings Ninth Symposium on Reliable Distributed Systems. IEEE, 66--75.
James W Stamos and Flaviu Cristian. 1993. Coordinator log transaction execution protocol. Distributed and Parallel Databases 1, 4 (1993), 383--408.
Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. 2007. The end of an architectural era: (it's time for a complete rewrite). In VLDB. 1150--1160.
Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. 36, 2 (2013), 21--27.
The Transaction Processing Council. 2007. TPC-C Benchmark (Revision 5.9.0). http://www.tpc.org/tpcc/.
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In SIGMOD. 1--12.
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy Transactions in Multicore In-memory Databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). 18--32.
Guozhang Wang, Joel Koshy, Sriram Subramanian, Kartik Paramasivam, Mammad Zadeh, Neha Narkhede, Jun Rao, Jay Kreps, and Joe Stein. 2015. Building a replicated logging system with Apache Kafka. Proceedings of the VLDB Endowment 8, 12 (2015), 1654--1655.
Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. An empirical evaluation of in-memory multi-version concurrency control. Proceedings of the VLDB Endowment 10, 7 (2017), 781--792.
Chang Yao, Divyakant Agrawal, Gang Chen, Qian Lin, Beng Chin Ooi, Weng-Fai Wong, and Meihui Zhang. 2016. Exploiting single-threaded model in multi-core in-memory systems. IEEE Transactions on Knowledge and Data Engineering 28, 10 (2016), 2635--2650.
Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, and Srinivas Devadas. 2016. Tic-toc: Time traveling optimistic concurrency control. In Proceedings of the 2016 International Conference on Management of Data. 1629--1642.
Xiangyao Yu, Yu Xia, Andrew Pavlo, Daniel Sanchez, Larry Rudolph, and Srinivas Devadas. 2018. Sundial: harmonizing concurrency control and caching in a distributed OLTP database management system. Proceedings of the VLDB Endowment 11, 10 (2018), 1289--1302.
Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism. In OSDI.

Cited By

View all
  • (2024)Spectrum: Speedy and Strictly-Deterministic Smart Contract Transactions for Blockchain LedgersProceedings of the VLDB Endowment10.14778/3675034.367504517:10(2541-2554)Online publication date: 1-Jun-2024
  • (2024)Occam's Razor for Distributed ProtocolsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698514(618-636)Online publication date: 20-Nov-2024
  • (2023)R3: Record-Replay-Retroaction for Database-Backed ApplicationsProceedings of the VLDB Endowment10.14778/3611479.361151016:11(3085-3097)Online publication date: 24-Aug-2023
  1. Lotus: scalable multi-partition transactions on single-threaded partitioned databases



    Information & Contributors


    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 15, Issue 11
    July 2022
    980 pages
    Issue’s Table of Contents


    VLDB Endowment

    Publication History

    Published: 01 July 2022
    Published in PVLDB Volume 15, Issue 11



    • Research-article


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)81
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 08 Feb 2025

    Other Metrics


    Cited By

    View all
    • (2024)Spectrum: Speedy and Strictly-Deterministic Smart Contract Transactions for Blockchain LedgersProceedings of the VLDB Endowment10.14778/3675034.367504517:10(2541-2554)Online publication date: 1-Jun-2024
    • (2024)Occam's Razor for Distributed ProtocolsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698514(618-636)Online publication date: 20-Nov-2024
    • (2023)R3: Record-Replay-Retroaction for Database-Backed ApplicationsProceedings of the VLDB Endowment10.14778/3611479.361151016:11(3085-3097)Online publication date: 24-Aug-2023

    View Options

    Login options

    Full Access

    View options


    View or Download as a PDF file.



    View online with eReader.







    Share this Publication link

    Share on social media