research-article

Open access

Fail-slow fault tolerance needs programming support

Authors:

Tianyin XuAuthors Info & Claims

HotOS '21: Proceedings of the Workshop on Hot Topics in Operating Systems

Pages 228 - 235

https://doi.org/10.1145/3458336.3465299

Published: 03 June 2021 Publication History

Abstract

The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.

References

[1]

Manage Chained Replication. https://docs.mongodb.com/manual/tutorial/manage-chained-replication/.

[2]

Adya, A., Howell, J., Theimer, M., Bolosky, W. J., and Douceur, J. R. Cooperative Task Management without Manual Stack Management. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'02) (June 2002).

[3]

Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03) (Oct. 2003).

Digital Library

[4]

Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Fail-Stutter Fault Tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS'01) (May 2001).

[5]

Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H. H., Padhye, J., Loo, B. T., and Outhred, G. 007: Democratically Finding the Cause of Packet Drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI'18) (Apr. 2018).

[6]

Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. Using Magpie for request extraction and workload modelling. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (Dec. 2004).

[7]

Boucher, S., Kalia, A., Andersen, D. G., and Kaminsky, M. Lightweight Preemptible Functions. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'20) (July 2020).

[8]

Burrows, M. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX Conference on Operating Systems Design and Implementation (OSDI'06) (Nov. 2006).

[9]

Chandra, T. D., Griesemer, R., and Redstone, J. Paxos Made Live - An Engineering Perspective. In Proceedings of the 26th annual ACM symposium on Principles of Distributed Computing (PODC'07) (Aug. 2007).

Digital Library

[10]

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC'10) (June 2010).

Digital Library

[11]

Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's Globally-Distributed Database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12) (Oct. 2012).

Digital Library

[12]

Cunningham, R., and Kohler, E. Making Events Less Slippery With eel. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS'05) (June 2005).

[13]

Curtsinger, C., and Berger, E. D. COZ: Finding Code that Counts with Causal Profiling. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).

Digital Library

[14]

Dabek, F., Zeldovich, N., Kaashoek, F., Mazières, D., and Morris, R. Event-Driven Programming for Robust Software. In Proceedings of the 10th Workshop on ACM SIGOPS European Workshop (July 2002).

Digital Library

[15]

Do, T., Hao, M., Leesatapornwongsa, T., Patana-anake, T., and Gunawi, H. S. Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems. In Proceedings of the 4th ACM Symposium on Cloud Computing (SOCC'13) (Oct. 2013).

Digital Library

[16]

Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundararaman, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., Srinivasan, D., Panda, B., Baptist, A., Grider, G., Fields, P. M., Harms, K., Ross, R. B., Jacobson, A., Ricci, R., Webb, K., Alvaro, P., Runesha, H. B., Hao, M., and Li, H. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18) (Feb. 2018).

Digital Library

[17]

Hawblitzel, C., Howell, J., Kapritsos, M., Lorch, J. R., Parno, B., Roberts, M. L., Setty, S., and Zill, B. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).

Digital Library

[18]

Huang, D., Liu, Q., Cui, Q., Fang, Z., Ma, X., Xu, F., Shen, L., Tang, L., Zhou, Y., Huang, M., Wei, W., Liu, C., Zhang, J., Li, J., Wu, X., Song, L., Sun, R., Yu, S., Zhao, L., Cameron, N., Pei, L., and Tang, X. TiDB: A Raft-based HTAP Database. In Proceedings of the 46th International Conference on Very Large Data Bases (VLDB'20) (Sept. 2020).

[19]

Huang, P., Guo, C., Lorch, J. R., Zhou, L., and Dang, Y. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) (Oct. 2018).

Digital Library

[20]

Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Yao, R. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS'17) (May 2017).

Digital Library

[21]

Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'10) (June 2010).

[22]

Isard, M. Autopilot: Automatic Data Center Management. SIGOPS Operating System Review 41, 2 (Apr. 2007), 60--67.

Digital Library

[23]

Jha, S., Cui, S., Banerjee, S., Xu, T., Enos, J., Showerman, M., Kalbarczyk, Z. T., and Iyer, R. K. Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC'20) (Nov. 2020).

[24]

Krohn, M., Kohler, E., and Kaashoek, M. F. Events Can Make Sense. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'07) (June 2007).

[25]

Lamport, L. Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Dec. 2001), 51--58.

[26]

Lamport, L. Fast Paxos. Tech. Rep. MSR-TR-2005-112, Microsoft Research, 2005.

[27]

Li, J., Chen, Y., Liu, H., Lu, S., Zhang, Y., Gunawi, H. S., Gu, X., Lu, X., and Li, D. PCatch: Automatically Detecting Performance Cascading Bugs in Cloud Systems. In Proceedings of the 39th ACM European Conference in Computer Systems (EuroSys'18) (Apr. 2018).

Digital Library

[28]

Lou, C., Huang, P., and Smith, S. Understanding, Detecting and Localizing Partial Failures in Large System Software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).

[29]

Mace, J., Roelke, R., and Fonseca, R. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).

Digital Library

[30]

Moraru, I., Andersen, D. G., and Kaminsky, M. There is more consensus in egalitarian parliaments. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'13) (Nov. 2013).

Digital Library

[31]

Ngo, K., Sen, S., and Lloyd, W. Tolerating Slowdowns in Replicated State Machines using Copilots. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).

[32]

Ongaro, D., and Ousterhout, J. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC'14) (June 2014).

[33]

Ousterhout, J. Why threads are a bad idea (for most purposes). In Presentation at the 1996 USENIX Annual Technical Conference (Sept. 1995).

[34]

Panda, B., Srinivasan, D., Ke, H., Gupta, K., Khot, V., and Gunawi, H. S. IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'19) (July 2019).

[35]

Schultz, W., Avitabile, T., and Cabral, A. Tunable consistency in MongoDB. In Proceedings of the 45th International Conference on Very Large Data Bases (VLDB'19) (Aug. 2019).

Digital Library

[36]

Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI'19) (Feb. 2019).

[37]

van Renesse, R., and Schneider, F. B. Chain Replication for Supporting High Throughput and Availability. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (Dec. 2004).

[38]

Wilcox, J. R., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M. D., and Anderson, T. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'15) (June 2015).

Digital Library

[39]

Xu, T., Zhang, J., Huang, P., Zheng, J., Sheng, T., Yuan, D., Zhou, Y., and Pasupathy, S. Do Not Blame Users for Misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP'13) (Nov. 2013).

Digital Library

[40]

Zhang, I., Sharma, N. K., Szekeres, A., Krishnamurthy, A., and Ports, D. R. K. Building consistent transactions with inconsistent replication. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).

Digital Library

[41]

Zhang, Q., Yu, G., Guo, C., Dang, Y., Swanson, N., Yang, X., Yao, R., Chintalapati, M., Krishnamurthy, A., and Anderson, T. Deep-view: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI'18) (Apr. 2018).

Cited By

Tang LBhandari CZhang YKaranika AJi SGupta IXu TFedorova ANarayanan DDi Luna GQuerzoni L(2023)Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587448(433-451)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587448

Fail-slow fault tolerance needs programming support
1. Computer systems organization

Recommendations

Fault-tolerant fault tolerance for component-based automation systems
ISARCS '13: Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems

To guarantee high availability, automation systems must be fault-tolerant. To this end, they must provide redundant solutions for the critical parts of the system. Classical fault tolerance patterns such as standby or N-modular redundancy provide system ...
Revisiting fault diagnosis agreement in a new territory

In convention, to consensus has been discussed variously. The way of fault masking is commonly used to reach consensus. However, reaching consensus is not enough in a high reliability application. Therefore, in this study, the fault diagnosis agreement ...
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HotOS '21: Proceedings of the Workshop on Hot Topics in Operating Systems

June 2021

251 pages

ISBN:9781450384384

DOI:10.1145/3458336

General Chair:
Sebastian Angel
University of Pennsylvania
,
Program Chairs:
Baris Kasikci
University of Michigan
,
Eddie Kohler
Harvard University

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

HotOS '21

Sponsor:

SIGOPS

HotOS '21: Workshop on Hot Topics in Operating Systems

June 1 - 3, 2021

Michigan, Ann Arbor

Upcoming Conference

HOTOS '25

Sponsor:
sigops

Workshop on Hot Topics in Operating Systems

May 14 - 16, 2025

Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
552
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)14

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang LBhandari CZhang YKaranika AJi SGupta IXu TFedorova ANarayanan DDi Luna GQuerzoni L(2023)Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587448(433-451)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587448

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents