Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

In-network leaderless replication for distributed data stores

Published: 01 March 2022 Publication History

Abstract

Leaderless replication allows any replica to handle any type of request to achieve read scalability and high availability for distributed data stores. However, this entails burdensome coordination overhead of replication protocols, degrading write throughput. In addition, the data store still requires coordination for membership changes, making it hard to resolve server failures quickly. To this end, we present NetLR, a replicated data store architecture that supports high performance, fault tolerance, and linearizability simultaneously. The key idea of NetLR is moving the entire replication functions into the network by leveraging the switch as an on-path in-network replication orchestrator. Specifically, NetLR performs consistency-aware read scheduling, high-performance write coordination, and active fault adaptation in the network switch. Our in-network replication eliminates inter-replica coordination for writes and membership changes, providing high write performance and fast failure handling. NetLR can be implemented using programmable switches at a line rate with only 5.68% of additional memory usage. We implement a prototype of NetLR on an Intel Tofino switch and conduct extensive testbed experiments. Our evaluation results show that NetLR is the only solution that achieves high throughput and low latency and is robust to server failures.

References

[1]
[n.d.]. Apache Cassandra. https://cassandra.apache.org/, Last accessed date: March 25, 2022.
[2]
[n.d.]. Cavium XPliant Ethernet switch. https://www.openswitch.net/cavium/, Last accessed date: March 25, 2022.
[3]
[n.d.]. A fast, compliant alternative implementation of Python. https://www.pypy.org/, Last accessed date: March 25, 2022.
[4]
[n.d.]. pypacker: The fastest and simplest packet manipulation lib for Python. https://gitlab.com/mike01/pypacker, Last accessed date: March 25, 2022.
[5]
[n.d.]. RocksDB: A Persistent Key-Value Store for Flash and RAM Storage. https://rocksdb.org/, Last accessed date: March 25, 2022.
[6]
[n.d.]. Tofino Programmable Switch. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-series.html, Last accessed date: March 25, 2022.
[7]
2020. Advanced Congestion & Flow Control with Programmable Switches. https://opennetworking.org/wp-content/uploads/2020/04/JK-Lee-Slide-Deck.pdf, Last accessed date: March 25, 2022.
[8]
Ailidani Ailijiang, Aleksey Charapko, and Murat Demirbas. 2019. Dissecting the Performance of Strongly-Consistent Replication Protocols. In Proc. of ACM SIGMOD. Association for Computing Machinery, New York, NY, USA, 1696--1710.
[9]
Peter A. Alsberg and John D. Day. 1976. A Principle for Resilient Sharing of Distributed Resources. In Proc. of ICSE (San Francisco, California, USA). IEEE Computer Society Press, Washington, DC, USA, 562--570.
[10]
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: Programming Protocol-independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, 3 July 2014), 87--95.
[11]
Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du. 2020. Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In Proc. of USENIX FAST. USENIX Association, Santa Clara, CA.
[12]
Josiah L. Carlson. 2013. Redis in Action. Manning Publications Co., USA.
[13]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-Value Store. In Proc. of ACM SOSP (Stevenson, Washington, USA). Association for Computing Machinery, New York, NY, USA, 205--220.
[14]
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In Proc. of ACM SIGCOMM (Toronto, Ontario, Canada). 350--361.
[15]
Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey Gubarev, Shivakumar Venkataraman, and Divyakant Agrawal. 2014. Mesa: Geo-Replicated, near Real-Time, Scalable Data Warehousing. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1259--1270.
[16]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463--492.
[17]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-Free Coordination for Internet-Scale Systems. In Proc. of USENIX ATC (Boston, MA). USENIX Association, USA, 11.
[18]
Theo Jepsen, Alberto Lerner, Fernando Pedone, Robert Soulé, and Philippe Cudré-Mauroux. 2021. In-Network Support for Transaction Triaging. Proc. VLDB Endow. 14, 9 (may 2021), 1626--1639.
[19]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proc. of ACM SOSP (Shanghai, China). 121--136.
[20]
Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-Performance Broadcast for Primary-Backup Systems. In Proc. of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN '11). IEEE Computer Society, USA, 245--256.
[21]
Antonios Katsarakis, Vasilis Gavrielatos, M.R. Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, and Vijay Nagarajan. 2020. Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol. In Proc. of ASPLOS (Lausanne, Switzerland). Association for Computing Machinery, New York, NY,USA, 201--217.
[22]
Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169.
[23]
Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering. In Proc. of USENIX OSDI (Savannah, GA, USA). USENIX Association, USA, 467--483.
[24]
Jialin Li, Jacob Nelson, Ellis Michael, Xin Jin, and Dan R. K. Ports. 2020. Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories. In Proc. of USENIX OSDI. USENIX Association, 387--406.
[25]
Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached. In Proc. of ISCA (Tel-Aviv, Israel). Association for Computing Machinery, New York, NY, USA, 36--47.
[26]
Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. 2017. IncBricks: Toward In-Network Computation with an In-Network Cache. In Proc. of ASPLOS (Xian, China). Association for Computing Machinery, New York, NY, USA, 795--809.
[27]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machines for WANs. In Proc. of USENIX OSDI (San Diego, California). USENIX Association, USA, 369--384.
[28]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is More Consensus in Egalitarian Parliaments. In Proc. of ACM SOSP (Farminton, Pennsylvania). Association for Computing Machinery, New York, NY, USA, 358--372.
[29]
Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache at Facebook. In Proc. of USENIX NSDI (Lombard, IL). USENIX Association, Berkeley, CA, USA, 385--398.
[30]
Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proc. of USENIX ATC (Philadelphia, PA). USENIX Association, USA, 305--320.
[31]
Jun Rao, Eugene J. Shekita, and Sandeep Tata. 2011. Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore. Proc. VLDB Endow. 4, 4 (Jan. 2011), 243--254.
[32]
Robbert Van Renesse and Fred B. Schneider. 2004. Chain Replication for Supporting High Throughput and Availability. In Proc. of USENIX OSDI. USENIX Association, San Francisco, CA, 91--104.
[33]
Aneesh Sharma, Jerry Jiang, Praveen Bommannavar, Brian Larson, and Jimmy Lin. 2016. GraphJet: Real-Time Content Recommendations at Twitter. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1281--1292.
[34]
Jeff Terrace and Michael J. Freedman. 2009. Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads. In Proc. of USENIX ATC (San Diego, California). 11.
[35]
Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4, 9 (June 2011), 528--538.
[36]
Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, Sachin Kulkarni, Nathan Lawrence, Mark Marchukov, Dmitri Petrov, and Lovro Puzar. 2012. TAO: How Facebook Serves the Social Graph. In Proc. of ACM SIGMOD (Scottsdale, Arizona, USA). Association for Computing Machinery, New York, NY, USA, 791--792.
[37]
Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein, Heidi Howard, Ion Stoica, and Adriana Szekeres. 2021. Scaling Replicated State Machines with Compartmentalization. Proc. VLDB Endow. 14, 11 (July 2021), 2203--2215.
[38]
Juncheng Yang, Yao Yue, and K. V. Rashmi. 2020. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In Proc. of USENIX OSDI. USENIX Association, 191--208.
[39]
Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: High-Availability Storage Made Practical in WeChat. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1730--1741.
[40]
Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. 2019. Harmonia: Near-Linear Scalability for Replicated Storage with in-Network Conflict Detection. Proc. VLDB Endow. 13, 3 (Nov. 2019), 376--389.

Cited By

View all
  • (2023)NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCsProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604820(195-207)Online publication date: 10-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 7
March 2022
208 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2022
Published in PVLDB Volume 15, Issue 7

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)8
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCsProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604820(195-207)Online publication date: 10-Sep-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media