Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3458336.3465299acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article
Open access

Fail-slow fault tolerance needs programming support

Published: 03 June 2021 Publication History

Abstract

The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.

References

[1]
Manage Chained Replication. https://docs.mongodb.com/manual/tutorial/manage-chained-replication/.
[2]
Adya, A., Howell, J., Theimer, M., Bolosky, W. J., and Douceur, J. R. Cooperative Task Management without Manual Stack Management. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'02) (June 2002).
[3]
Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03) (Oct. 2003).
[4]
Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Fail-Stutter Fault Tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS'01) (May 2001).
[5]
Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H. H., Padhye, J., Loo, B. T., and Outhred, G. 007: Democratically Finding the Cause of Packet Drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI'18) (Apr. 2018).
[6]
Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. Using Magpie for request extraction and workload modelling. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (Dec. 2004).
[7]
Boucher, S., Kalia, A., Andersen, D. G., and Kaminsky, M. Lightweight Preemptible Functions. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'20) (July 2020).
[8]
Burrows, M. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX Conference on Operating Systems Design and Implementation (OSDI'06) (Nov. 2006).
[9]
Chandra, T. D., Griesemer, R., and Redstone, J. Paxos Made Live - An Engineering Perspective. In Proceedings of the 26th annual ACM symposium on Principles of Distributed Computing (PODC'07) (Aug. 2007).
[10]
Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC'10) (June 2010).
[11]
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's Globally-Distributed Database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12) (Oct. 2012).
[12]
Cunningham, R., and Kohler, E. Making Events Less Slippery With eel. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS'05) (June 2005).
[13]
Curtsinger, C., and Berger, E. D. COZ: Finding Code that Counts with Causal Profiling. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).
[14]
Dabek, F., Zeldovich, N., Kaashoek, F., Mazières, D., and Morris, R. Event-Driven Programming for Robust Software. In Proceedings of the 10th Workshop on ACM SIGOPS European Workshop (July 2002).
[15]
Do, T., Hao, M., Leesatapornwongsa, T., Patana-anake, T., and Gunawi, H. S. Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems. In Proceedings of the 4th ACM Symposium on Cloud Computing (SOCC'13) (Oct. 2013).
[16]
Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundararaman, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., Srinivasan, D., Panda, B., Baptist, A., Grider, G., Fields, P. M., Harms, K., Ross, R. B., Jacobson, A., Ricci, R., Webb, K., Alvaro, P., Runesha, H. B., Hao, M., and Li, H. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18) (Feb. 2018).
[17]
Hawblitzel, C., Howell, J., Kapritsos, M., Lorch, J. R., Parno, B., Roberts, M. L., Setty, S., and Zill, B. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).
[18]
Huang, D., Liu, Q., Cui, Q., Fang, Z., Ma, X., Xu, F., Shen, L., Tang, L., Zhou, Y., Huang, M., Wei, W., Liu, C., Zhang, J., Li, J., Wu, X., Song, L., Sun, R., Yu, S., Zhao, L., Cameron, N., Pei, L., and Tang, X. TiDB: A Raft-based HTAP Database. In Proceedings of the 46th International Conference on Very Large Data Bases (VLDB'20) (Sept. 2020).
[19]
Huang, P., Guo, C., Lorch, J. R., Zhou, L., and Dang, Y. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) (Oct. 2018).
[20]
Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Yao, R. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS'17) (May 2017).
[21]
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'10) (June 2010).
[22]
Isard, M. Autopilot: Automatic Data Center Management. SIGOPS Operating System Review 41, 2 (Apr. 2007), 60--67.
[23]
Jha, S., Cui, S., Banerjee, S., Xu, T., Enos, J., Showerman, M., Kalbarczyk, Z. T., and Iyer, R. K. Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC'20) (Nov. 2020).
[24]
Krohn, M., Kohler, E., and Kaashoek, M. F. Events Can Make Sense. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'07) (June 2007).
[25]
Lamport, L. Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Dec. 2001), 51--58.
[26]
Lamport, L. Fast Paxos. Tech. Rep. MSR-TR-2005-112, Microsoft Research, 2005.
[27]
Li, J., Chen, Y., Liu, H., Lu, S., Zhang, Y., Gunawi, H. S., Gu, X., Lu, X., and Li, D. PCatch: Automatically Detecting Performance Cascading Bugs in Cloud Systems. In Proceedings of the 39th ACM European Conference in Computer Systems (EuroSys'18) (Apr. 2018).
[28]
Lou, C., Huang, P., and Smith, S. Understanding, Detecting and Localizing Partial Failures in Large System Software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).
[29]
Mace, J., Roelke, R., and Fonseca, R. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).
[30]
Moraru, I., Andersen, D. G., and Kaminsky, M. There is more consensus in egalitarian parliaments. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'13) (Nov. 2013).
[31]
Ngo, K., Sen, S., and Lloyd, W. Tolerating Slowdowns in Replicated State Machines using Copilots. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).
[32]
Ongaro, D., and Ousterhout, J. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC'14) (June 2014).
[33]
Ousterhout, J. Why threads are a bad idea (for most purposes). In Presentation at the 1996 USENIX Annual Technical Conference (Sept. 1995).
[34]
Panda, B., Srinivasan, D., Ke, H., Gupta, K., Khot, V., and Gunawi, H. S. IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC'19) (July 2019).
[35]
Schultz, W., Avitabile, T., and Cabral, A. Tunable consistency in MongoDB. In Proceedings of the 45th International Conference on Very Large Data Bases (VLDB'19) (Aug. 2019).
[36]
Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI'19) (Feb. 2019).
[37]
van Renesse, R., and Schneider, F. B. Chain Replication for Supporting High Throughput and Availability. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (Dec. 2004).
[38]
Wilcox, J. R., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M. D., and Anderson, T. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'15) (June 2015).
[39]
Xu, T., Zhang, J., Huang, P., Zheng, J., Sheng, T., Yuan, D., Zhou, Y., and Pasupathy, S. Do Not Blame Users for Misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP'13) (Nov. 2013).
[40]
Zhang, I., Sharma, N. K., Szekeres, A., Krishnamurthy, A., and Ports, D. R. K. Building consistent transactions with inconsistent replication. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'15) (Oct. 2015).
[41]
Zhang, Q., Yu, G., Guo, C., Dang, Y., Swanson, N., Yang, X., Yao, R., Chintalapati, M., Krishnamurthy, A., and Anderson, T. Deep-view: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI'18) (Apr. 2018).

Cited By

View all
  • (2023)Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587448(433-451)Online publication date: 8-May-2023
  1. Fail-slow fault tolerance needs programming support

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HotOS '21: Proceedings of the Workshop on Hot Topics in Operating Systems
    June 2021
    251 pages
    ISBN:9781450384384
    DOI:10.1145/3458336
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. consensus
    2. distributed systems
    3. fail slow
    4. fault tolerance

    Qualifiers

    • Research-article

    Funding Sources

    • NSF

    Conference

    HotOS '21
    Sponsor:

    Upcoming Conference

    HOTOS '25
    Workshop on Hot Topics in Operating Systems
    May 14 - 16, 2025
    Banff or Lake Louise , AB , Canada

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)132
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud SystemsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587448(433-451)Online publication date: 8-May-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media