research-article

In-network leaderless replication for distributed data stores

Authors:

Wonjun LeeAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 7

Pages 1337 - 1349

https://doi.org/10.14778/3523210.3523213

Published: 01 March 2022 Publication History

Abstract

Leaderless replication allows any replica to handle any type of request to achieve read scalability and high availability for distributed data stores. However, this entails burdensome coordination overhead of replication protocols, degrading write throughput. In addition, the data store still requires coordination for membership changes, making it hard to resolve server failures quickly. To this end, we present NetLR, a replicated data store architecture that supports high performance, fault tolerance, and linearizability simultaneously. The key idea of NetLR is moving the entire replication functions into the network by leveraging the switch as an on-path in-network replication orchestrator. Specifically, NetLR performs consistency-aware read scheduling, high-performance write coordination, and active fault adaptation in the network switch. Our in-network replication eliminates inter-replica coordination for writes and membership changes, providing high write performance and fast failure handling. NetLR can be implemented using programmable switches at a line rate with only 5.68% of additional memory usage. We implement a prototype of NetLR on an Intel Tofino switch and conduct extensive testbed experiments. Our evaluation results show that NetLR is the only solution that achieves high throughput and low latency and is robust to server failures.

References

[1]

[n.d.]. Apache Cassandra. https://cassandra.apache.org/, Last accessed date: March 25, 2022.

[2]

[n.d.]. Cavium XPliant Ethernet switch. https://www.openswitch.net/cavium/, Last accessed date: March 25, 2022.

[3]

[n.d.]. A fast, compliant alternative implementation of Python. https://www.pypy.org/, Last accessed date: March 25, 2022.

[4]

[n.d.]. pypacker: The fastest and simplest packet manipulation lib for Python. https://gitlab.com/mike01/pypacker, Last accessed date: March 25, 2022.

[5]

[n.d.]. RocksDB: A Persistent Key-Value Store for Flash and RAM Storage. https://rocksdb.org/, Last accessed date: March 25, 2022.

[6]

[n.d.]. Tofino Programmable Switch. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-series.html, Last accessed date: March 25, 2022.

[7]

2020. Advanced Congestion & Flow Control with Programmable Switches. https://opennetworking.org/wp-content/uploads/2020/04/JK-Lee-Slide-Deck.pdf, Last accessed date: March 25, 2022.

[8]

Ailidani Ailijiang, Aleksey Charapko, and Murat Demirbas. 2019. Dissecting the Performance of Strongly-Consistent Replication Protocols. In Proc. of ACM SIGMOD. Association for Computing Machinery, New York, NY, USA, 1696--1710.

Digital Library

[9]

Peter A. Alsberg and John D. Day. 1976. A Principle for Resilient Sharing of Distributed Resources. In Proc. of ICSE (San Francisco, California, USA). IEEE Computer Society Press, Washington, DC, USA, 562--570.

[10]

Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: Programming Protocol-independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, 3 July 2014), 87--95.

Digital Library

[11]

Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du. 2020. Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In Proc. of USENIX FAST. USENIX Association, Santa Clara, CA.

[12]

Josiah L. Carlson. 2013. Redis in Action. Manning Publications Co., USA.

Digital Library

[13]

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-Value Store. In Proc. of ACM SOSP (Stevenson, Washington, USA). Association for Computing Machinery, New York, NY, USA, 205--220.

Digital Library

[14]

Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In Proc. of ACM SIGCOMM (Toronto, Ontario, Canada). 350--361.

Digital Library

[15]

Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey Gubarev, Shivakumar Venkataraman, and Divyakant Agrawal. 2014. Mesa: Geo-Replicated, near Real-Time, Scalable Data Warehousing. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1259--1270.

[16]

Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463--492.

Digital Library

[17]

Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-Free Coordination for Internet-Scale Systems. In Proc. of USENIX ATC (Boston, MA). USENIX Association, USA, 11.

Digital Library

[18]

Theo Jepsen, Alberto Lerner, Fernando Pedone, Robert Soulé, and Philippe Cudré-Mauroux. 2021. In-Network Support for Transaction Triaging. Proc. VLDB Endow. 14, 9 (may 2021), 1626--1639.

Digital Library

[19]

Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proc. of ACM SOSP (Shanghai, China). 121--136.

Digital Library

[20]

Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-Performance Broadcast for Primary-Backup Systems. In Proc. of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN '11). IEEE Computer Society, USA, 245--256.

Digital Library

[21]

Antonios Katsarakis, Vasilis Gavrielatos, M.R. Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, and Vijay Nagarajan. 2020. Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol. In Proc. of ASPLOS (Lausanne, Switzerland). Association for Computing Machinery, New York, NY,USA, 201--217.

Digital Library

[22]

Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169.

Digital Library

[23]

Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering. In Proc. of USENIX OSDI (Savannah, GA, USA). USENIX Association, USA, 467--483.

[24]

Jialin Li, Jacob Nelson, Ellis Michael, Xin Jin, and Dan R. K. Ports. 2020. Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories. In Proc. of USENIX OSDI. USENIX Association, 387--406.

[25]

Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached. In Proc. of ISCA (Tel-Aviv, Israel). Association for Computing Machinery, New York, NY, USA, 36--47.

[26]

Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. 2017. IncBricks: Toward In-Network Computation with an In-Network Cache. In Proc. of ASPLOS (Xian, China). Association for Computing Machinery, New York, NY, USA, 795--809.

Digital Library

[27]

Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machines for WANs. In Proc. of USENIX OSDI (San Diego, California). USENIX Association, USA, 369--384.

[28]

Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is More Consensus in Egalitarian Parliaments. In Proc. of ACM SOSP (Farminton, Pennsylvania). Association for Computing Machinery, New York, NY, USA, 358--372.

Digital Library

[29]

Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache at Facebook. In Proc. of USENIX NSDI (Lombard, IL). USENIX Association, Berkeley, CA, USA, 385--398.

[30]

Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proc. of USENIX ATC (Philadelphia, PA). USENIX Association, USA, 305--320.

Digital Library

[31]

Jun Rao, Eugene J. Shekita, and Sandeep Tata. 2011. Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore. Proc. VLDB Endow. 4, 4 (Jan. 2011), 243--254.

Digital Library

[32]

Robbert Van Renesse and Fred B. Schneider. 2004. Chain Replication for Supporting High Throughput and Availability. In Proc. of USENIX OSDI. USENIX Association, San Francisco, CA, 91--104.

[33]

Aneesh Sharma, Jerry Jiang, Praveen Bommannavar, Brian Larson, and Jimmy Lin. 2016. GraphJet: Real-Time Content Recommendations at Twitter. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1281--1292.

Digital Library

[34]

Jeff Terrace and Michael J. Freedman. 2009. Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads. In Proc. of USENIX ATC (San Diego, California). 11.

[35]

Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4, 9 (June 2011), 528--538.

Digital Library

[36]

Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, Sachin Kulkarni, Nathan Lawrence, Mark Marchukov, Dmitri Petrov, and Lovro Puzar. 2012. TAO: How Facebook Serves the Social Graph. In Proc. of ACM SIGMOD (Scottsdale, Arizona, USA). Association for Computing Machinery, New York, NY, USA, 791--792.

Digital Library

[37]

Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein, Heidi Howard, Ion Stoica, and Adriana Szekeres. 2021. Scaling Replicated State Machines with Compartmentalization. Proc. VLDB Endow. 14, 11 (July 2021), 2203--2215.

Digital Library

[38]

Juncheng Yang, Yao Yue, and K. V. Rashmi. 2020. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In Proc. of USENIX OSDI. USENIX Association, 191--208.

[39]

Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: High-Availability Storage Made Practical in WeChat. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1730--1741.

Digital Library

[40]

Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. 2019. Harmonia: Near-Linear Scalability for Replicated Storage with in-Network Conflict Detection. Proc. VLDB Endow. 13, 3 (Nov. 2019), 376--389.

Digital Library

Cited By

Kim GSchulzrinne HKohler EMaltz DMisra V(2023)NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCsProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604820(195-207)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604820

Recommendations

Efficient replication management in distributed systems
Coding-Based Replication Schemes for Distributed Systems

Data is often replicated in distributed systems to improve availability and performance. This replication is expensive in terms of disk storage since the existing schemes generally require full files to be stored at each site. In this paper, we present ...
Availability Issues in Data Replication in Distributed Database

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 7

March 2022

208 pages

ISSN:2150-8097

Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2022

Published in PVLDB Volume 15, Issue 7

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
158
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)8

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim GSchulzrinne HKohler EMaltz DMisra V(2023)NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCsProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604820(195-207)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604820

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents