Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Partial Network Partitioning

Published: 18 December 2023 Publication History

Abstract

We present an extensive study focused on partial network partitioning. Partial network partitions disrupt the communication between some but not all nodes in a cluster. First, we conduct a comprehensive study of system failures caused by this fault in 13 popular systems. Our study reveals that the studied failures are catastrophic (e.g., lead to data loss), easily manifest, and are mainly due to design flaws. Our analysis identifies vulnerabilities in core systems mechanisms including scheduling, membership management, and ZooKeeper-based configuration management.
Second, we dissect the design of nine popular systems and identify four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault tolerance techniques are inadequate for modern systems; they either patch a particular mechanism or lead to a complete cluster shutdown, even when alternative network paths exist.
Finally, our findings motivate us to build Nifty, a transparent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Nifty provides an approach for applications to optimize their operation during a partial partition. We demonstrate the benefit of this approach through integrating Nifty with VoltDB, HDFS, and Kafka.

References

[1]
Daniel Turner, Kirill Levchenko, Jeffrey C. Mogul, Stefan Savage, Alex C. Snoeren, Daniel Turner, Kirill Levchenko, Jeffrey C. Mogul, Stefan Savage, and Alex C. Snoeren. 2012. On Failure in Managed Enterprise Networks. Technical Report HPL-2012-101. HP Labs.
[2]
Cisco Systems Inc. 2004. Data Center: Load Balancing Data Center, Solutions Reference Network Design. Technical Report. Cisco Systems Inc.
[3]
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or die: High-availability design principles drawn from Google’s network infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, New York, NY, 58–72.
[4]
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, et al. 2013. B4: Experience with a globally-deployed software defined WAN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 3–14.
[5]
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 350–361.
[6]
Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. 2011. California fault lines: Understanding the causes and impact of network failures. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 315–326.
[7]
Eric A. Brewer. 2001. Lessons from giant-scale services. IEEE Internet Computing 5, 4 (2001), 46–55.
[8]
David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do internet services fail, and what can be done about it? In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Vol. 67.
[9]
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, et al. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). 49–60.
[10]
Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, and Harsha V. Madhyastha. 2013. SPANStore: Cost-effective geo-replicated storage spanning multiple cloud services. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 292–308.
[11]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, et al. 2013. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems 31, 3 (2013), 8.
[12]
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. 2018. An analysis of network-partitioning failures in cloud systems. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 51–68.
[13]
Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51–59. DOI:
[14]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 305–319.
[15]
Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (2001), 18–25.
[16]
Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). 172–182.
[17]
Barbara Liskov and James Cowling. 2012. Viewstamped Replication Revisited. Technical Report MIT-CSAIL-TR-2012-021. MIT.
[18]
RabbitMQ. n.d. RabbitMQ Message Broker. Retrieved June 1, 2021 from https://www.rabbitmq.com.
[19]
Volt. n.d. VoltDB In-Memory Database Platform. Retrieved June 1, 2021 from https://www.voltdb.com/.
[20]
Ceph. n.d. The Ceph Object Store. Retrieved June 1, 2021 from https://ceph.io/.
[21]
Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany. 2020. Toward a generic fault tolerance technique for partial network partitioning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 351–368. https://www.usenix.org/conference/osdi20/presentation/alfatafta.
[22]
Robin J. Wilson. 2010. Introduction to Graph Theory. Prentice Hall/Pearson, New York, NY.
[23]
Tony Mills. 2011. bnx2 Cards Intermittently Going Offline. Retrieved June 1, 2021 from https://www.spinics.net/lists/netdev/msg152880.html.
[24]
CloudFlare Blog. 2020. A Byzantine Failure in the Real World. Retrieved June 1, 2021 from https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/.
[25]
Google Cloud. 2019. Google Cloud Networking Incident #18003. Retrieved June 1, 2021 from https://status.cloud.google.com/incident/cloud-networking/18003.
[26]
Andrey Falko. 2019. Lyft Engineering: Operating Apache Kafka Clusters 24/7 Without a Global Ops Team. Retrieved June 1, 2021 from https://eng.lyft.com/operating-apache-kafka-clusters-24-7-without-a-global-ops-team-417813a5ce70.
[27]
Datadog. 2013. Learning from AWS Failure. Retrieved June 1, 2021 from https://www.datadoghq.com/blog/gray-aws-failures/.
[28]
Simon J. Maple and Ian Robinson. 2015. Transaction recovery in a transaction processing computer system employing multiple transaction managers. US Patent 9,165,025.
[29]
Christian Maihofer. 2004. A survey of geocast routing protocols. IEEE Communications Surveys & Tutorials 6, 2 (2004), 32–42.
[30]
Matthew Milano and Andrew C. Myers. 2018. MixT: A language for mixing consistency in geodistributed transactions. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 226–241.
[31]
David Turner. 2017. Observability in Paxos Clusters. Retrieved June 1, 2021 from https://davecturner.github.io/2017/08/18/observability-in-paxos.html.
[32]
Rachel by the Bay. 2012. Partial Network Partitions and Obstacles to Innovation. Retrieved June 1, 2021 from https://rachelbythebay.com/w/2012/02/16/partition/.
[33]
GitHub. 2014. Partial Network Partition and Retries. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/6105.
[34]
Robust Perception. 2015. Healthchecking Is Not Transitive. Retrieved June 1, 2021 from https://www.robustperception.io/healthchecking-is-not-transitive.
[35]
GitHub. 2015. Cluster Broken After Switches Upgrade. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/9495.
[36]
Apache. n.d. Using Map Output Fetch Failures to Blacklist Nodes Is Problematic. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/MAPREDUCE-1800.
[37]
Elastic. n.d. Elasticsearch: Distributed Search & Analytics. Retrieved June 1, 2021 from https://www.elastic.co/products/elasticsearch.
[38]
MongoDB. n.d. MongoDB: The Database for Modern Applications. Retrieved June 1, 2021 from https://www.mongodb.com/.
[39]
Apache Hadoop. n.d. The Apache Hadoop Project. Retrieved June 1, 2021 from http://hadoop.apache.org/.
[40]
Apache HBase. n.d. Apache HBase. Retrieved June 1, 2021 from https://hbase.apache.org/.
[41]
Apache Mesos. n.d. Apache Mesos. Retrieved June 1, 2021 from http://mesos.apache.org/.
[42]
Hazelcast. n.d. Hazelcast | The Leading In-Memory Computing Platform. Retrieved June 1, 2021 from https://hazelcast.com/.
[43]
Apache Kafka. n.d. Kafka: A Distributed Streaming Platform. Retrieved June 1, 2021 from https://kafka.apache.org/.
[44]
MooseFS. n.d. MooseFS: Distributed File System. Retrieved June 1, 2021 from https://moosefs.com/.
[45]
ActiveMQ. n.d. ActiveMQ: Flexible & Powerful Open Source Multi-Protocol Messaging. Retrieved June 1, 2021 from http://activemq.apache.org/.
[46]
Dkron. n.d. Dkron: A Distributed Cron Service. Retrieved June 1, 2021 from https://dkron.io/.
[47]
MongoDB. n.d. Arbiters in pv1 Should Vote No in Elections If They Can See a Healthy Primary of Equal or Greater Priority to the Candidate. Retrieved June 1, 2021 from https://jira.mongodb.org/browse/SERVER-27125.
[48]
Apache. n.d. Possible Data Loss When RS Goes into GC Pause While Rolling HLog. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/HBASE-2312.
[49]
GitHub. 2014. Partial Network Partition and Retries. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/6105.
[50]
Hazelcast. n.d. Hazelcast: The Leading In-Memory Data Grid. Retrieved June 1, 2021 from https://hazelcast.com/.
[51]
Redis. n.d. Redis: In-Memory Data Structure Store. Retrieved June 1, 2021 from https://redis.io/.
[52]
A. Herr. 2016. Veritas Cluster Server 6.2 I/O Fencing Deployment Considerations. Technical Report. Veritas Technologies.
[53]
MongoDB. n.d. Two Primaries with Network Partitioned Replica Set (Non-Transient). Retrieved June 1, 2021 from https://jira.mongodb.org/browse/SERVER-2544.
[54]
GitHub. n.d. Synchronisation Causes Crash in Duplicated Master #1006. Retrieved June 1, 2021 from https://github.com/rabbitmq/rabbitmq-server/issues/1006.
[55]
Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing (PODC’88). ACM, New York, NY, 8–17. DOI:
[56]
GitHub. n.d. Partial Network Partitioning Leads to Cluster Unavailability. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/43183.
[57]
Apache ZooKeeper. n.d. Apache ZooKeeper. Retrieved June 1, 2021 from https://zookeeper.apache.org/.
[58]
Apache ZooKeeper. n.d. ZooKeeper Recipes and Solutions. Retrieved June 1, 2021 from https://zookeeper.apache.org/doc/current/recipes.html.
[59]
Apache. n.d. ActiveMQ Cluster Blocks Indefinitely in the Presence of Partial Network Partition. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/AMQ-7064.
[60]
Apache. n.d. Kafka Leader Election Doesn’t Happen When Leader Broker Pport Is Partitioned Off the Network. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/KAFKA-8702.
[61]
Giorgos Myrianthous. 2021. Kafka No Longer Requires ZooKeeper. Retrieved June 1, 2021 from https://towardsdatascience.com/kafka-no-longer-requires-zookeeper-ebfbf3862104.
[62]
Colin McCabe. 2020. Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency. Retrieved June 1, 2021 from https://www.confluent.io/blog/removing-zookeeper-dependency-in-kafka/.
[63]
Apache. n.d. MR AM Can Get in a Split Brain Situation. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/MAPREDUCE-4832.
[64]
Apache. n.d. Mesos-1529: Handle a Network Partition Between Master and Slave. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/MESOS-1529.
[65]
GitHub. n.d. Disconnect Between Coordinating Node and Shards Can Cause Duplicate Updates or Wrong Status Code.Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/9967.
[66]
GitHub. n.d. Mirrored Queue Crash with Out of Sync ACKs. Retrieved June 1, 2021 from https://github.com/rabbitmq/rabbitmq-server/issues/749.
[67]
Apache. n.d. Kafka Producer Is Not Fault Tolerant. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/KAFKA-3686.
[68]
GitHub. n.d. Partial Network Partition and Retries. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/6105.
[69]
Apache. n.d. NameNode Should Give Client the First Node in the Pipeline from Different Rack Other Than That of ExcludedNodes List in the Same Rack.Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/HDFS-1384.
[70]
Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB main memory DBMS. IEEE Data Engineering Bulletin 36, 2 (2013), 21–27.
[71]
GitHub. n.d. LogCabin. Retrieved June 1, 2021 from https://github.com/logcabin/logcabin.
[72]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press, Cambridge, MA.
[73]
Coen Bron and Joep Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (Sept. 1973), 575–577. DOI:
[74]
GitHub. n.d. Be More Resilient to Partial Network Partitions. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/pull/8720.
[75]
Apache Mesos. n.d. Designing Highly Available Mesos Frameworks. Retrieved June 1, 2021 from http://mesos.apache.org/documentation/latest/high-availability-framework-guide/.
[76]
GitHub. n.d. Wait on Shard Failures. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/14252.
[77]
Deep Medhi and Karthik Ramasamy. 2017. Network Routing: Algorithms, Protocols, and Architectures. Morgan Kaufmann.
[78]
Dimitri P. Bertsekas, Robert G. Gallager, and Pierre Humblet. 1992. Data Networks. Vol. 2. Prentice Hall International, Hoboken, NJ.
[79]
Open Networking Foundation. 2015. OpenFlow Switch Specification, Version 1.5.1 (ONF TS-025). Open Networking Foundation.
[80]
Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, et al. 2015. The design and implementation of Open vSwitch. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 117–130.
[81]
iPerf. n.d. iPerf: The Ultimate Speed Test Tool for TCP, UDP and SCTP. Retrieved June 1, 2021 from https://iperf.fr/.
[82]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, New York, NY, 143–154. DOI:
[83]
2018. TPC-H BENCHMARK (Decision Support) Standard Specification. Revision 2.18.0. Transaction Processing Performance Council.
[84]
Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, New York, NY, 193–204.
[85]
Robert Birke, Ioana Giurgiu, Lydia Y. Chen, Dorothea Wiesmann, and Ton Engbersen. 2014. Failure analysis of virtual and physical machines: Patterns, causes and characteristics. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, Los Alamitos, CA, 1–12.
[86]
Daniel Ford, François Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in Globally Distributed Storage Systems. Retrieved December 23, 2022 from https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/36737.pdf.
[87]
Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. 2008. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage 4, 3 (2008), 7.
[88]
Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail. In Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, Los Alamitos, CA, 1–12.
[89]
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, and Ramendra Sahoo. 2006. BlueGene/L failure analysis and prediction models. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE, Los Alamitos, CA, 425–434.
[90]
Bianca Schroeder and Garth Gibson. 2009. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4 (2009), 337–350.
[91]
Theophilus Benson, Sambit Sahu, Aditya Akella, and Anees Shaikh. 2010. A first look at problems in the cloud. HotCloud 10 (2010), 15.
[92]
Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Haibo Lin, Haoxiang Lin, and Tingting Qin. 2015. An empirical study on quality issues of production big data platform. In Proceedings of the 37th International Conference on Software Engineering—Volume 2. IEEE, Los Alamitos, CA, 17–26.
[93]
Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing. ACM, New York, NY, 1–16.
[94]
Ariel Rabkin and Randy Howard Katz. 2012. How hadoop clusters break. IEEE Software 30, 4 (2012), 88–94.
[95]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-Anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, et al. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY, 1–14.
[96]
Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. 2013. A characteristic study on failures of production distributed data-parallel programs. In Proceedings of the 2013 International Conference on Software Engineering. IEEE, Los Alamitos, CA, 963–972.
[97]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 249–265.
[98]
David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris. 2001. Resilient overlay networks. ACM SIGOPS Operating Systems Review 35, 5 (Oct. 2001), 131–145. DOI:
[99]
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Computer Communication Review 31, 4 (Aug. 2001), 149–160. DOI:
[100]
Antony Rowstron and Peter Druschel. 2001. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware 2001, Rachid Guerraoui (Ed.). Springer, Berlin, Germany, 329–350.
[101]
Yuchao Zhang, Junchen Jiang, Ke Xu, Xiaohui Nie, Martin J. Reed, Haiyang Wang, Guang Yao, Miao Zhang, and Kai Chen. 2018. BDS: A centralized near-optimal overlay network for inter-datacenter data replication. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, Article 10, 14 pages. DOI:
[102]
John Byers, Jeffrey Considine, Michael Mitzenmacher, and Stanislav Rost. 2002. Informed content delivery across adaptive overlay networks. InProceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’02). ACM, New York, NY, 47–60. DOI:
[103]
Michael Dalton, David Schultz, Jacob Adriaens, Ahsan Arefin, Anshuman Gupta, Brian Fahs, Dima Rubinstein, et al. 2018. Andromeda: Performance, isolation, and velocity at scale in cloud network virtualization. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18). 373–387.
[104]
An Wang, Yang Guo, Songqing Chen, Fang Hao, T. V. Lakshman, Doug Montgomery, and Kotikalapudi Sriram. 2017. vPROM: VSwitch enhanced programmable measurement in SDN. In Proceedings of the 2017 IEEE 25th International Conference on Network Protocols (ICNP’17). IEEE, Los Alamitos, CA, 1–10.
[105]
Zili Zha, An Wang, Yang Guo, Doug Montgomery, and Songqing Chen. 2018. Instrumenting Open vSwitch with monitoring capabilities: Designs and challenges. In Proceedings of the Symposium on SDN Research. ACM, New York, NY, 16.
[106]
Pakapol Krongbaramee and Yuthapong Somchit. 2018. Implementation of SDN stateful firewall on data plane using Open vSwitch. In Proceedings of the 2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE’18). IEEE, Los Alamitos, CA, 1–5.
[107]
Anat Bremler-Barr, David Hay, Idan Moyal, and Liron Schiff. 2017. Load balancing memcached traffic using software defined networking. In Proceedings of the 2017 IFIP Networking Conference (IFIP Networking’17) and Workshops. IEEE, Los Alamitos, CA, 1–9.
[108]
Alex F. R. Trajano and Marcial P. Fernandez. 2015. Two-phase load balancing of in-memory key-value storages through NFV and SDN. In Proceedings of the 2015 IEEE Symposium on Computers and Communication (ISCC’15). IEEE, Los Alamitos, CA, 409–414.
[109]
I. Kettaneh, A. Alquraan, H. Takruri, S. Yang, A. S. Dusseau, R. Arpaci-Dusseau, and S. Al-Kiswany. 2020. The network-integrated storage system. IEEE Transactions on Parallel and Distributed Systems 31, 2 (2020), 486–500. DOI:
[110]
Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, and Michael J. Freedman. 2016. Be fast, cheap and in control with SwitchKV. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). 31–44.

Cited By

View all
  • (2023)CASPR: Connectivity-Aware Scheduling for Partition Resilience2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00017(70-81)Online publication date: 25-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 41, Issue 1-4
November 2023
188 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/3637801
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 December 2023
Online AM: 19 December 2022
Accepted: 17 November 2022
Revised: 22 March 2022
Received: 14 September 2021
Published in TOCS Volume 41, Issue 1-4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Network failures
  2. fault tolerance
  3. partial network partitions
  4. distributed systems
  5. reliability

Qualifiers

  • Research-article

Funding Sources

  • NSERC Discovery grant
  • Canada Foundation for Innovation (CFI) grant
  • NSERC Collaborative Research and Development (CRD) grant
  • Waterloo-Huawei Joint Innovation Lab grant
  • IBM Ph.D. fellowship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)523
  • Downloads (Last 6 weeks)29
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)CASPR: Connectivity-Aware Scheduling for Partition Resilience2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00017(70-81)Online publication date: 25-Sep-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media