Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2930611.2930631guideproceedingsArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

Minimizing faulty executions of distributed systems

Published: 16 March 2016 Publication History

Abstract

When troubleshooting buggy executions of distributed systems, developers typically start by manually separating out events that are responsible for triggering the bug (signal) from those that are extraneous (noise). We present DEMi, a tool for automatically performing this minimization. We apply DEMi to buggy executions of two very different distributed systems, Raft and Spark, and find that it produces minimized executions that are between 1X and 4.6X the size of optimal executions.

References

[1]
7 Tips for Fuzzing Firefox More Effectively. https://blog.mozilla.org/security/2012/06/20/7-tips-for-fuzzing-firefox-more-effectively/.
[2]
Akka official website. http://akka.io/.
[3]
akka-raft Github repo. https://github.com/ktoso/akka-raft.
[4]
Apache Spark Github repo. https://github.com/apache/spark/.
[5]
DEMi Github repo. https://github.com/NetSys/demi.
[6]
GNU's guide to testcase reduction. https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction.
[7]
LLVM bugpoint tool: design and usage. http://llvm.org/docs/Bugpoint.html.
[8]
M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication. International Workshop on Distributed Algorithms '97.
[9]
G. Altekar and I. Stoica. ODR: Output-Deterministic Replay for Multicore Debugging. SOSP '09.
[10]
T. Arts, J. Hughes, J. Johansson, and U. Wiger. Testing Telecoms Software with Quviq QuickCheck. Erlang '06.
[11]
T. Avgerinos, A. Rebert, S. K. Cha, and D. Brumley. Enhancing Symbolic Execution with Veritesting. ICSE '14.
[12]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. OSDI '04.
[13]
Basho Blog. QuickChecking Poolboy for Fun and Profit. http://tinyurl.com/qgc387k.
[14]
I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring Models of Concurrent Systems from Logs of their Behavior with CSight. ICSE '14.
[15]
I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging Existing Instrumentation to Automatically Infer Invariant-Constrained Models. ESEC/FSE '11.
[16]
A.W. Biermann and J. A. Feldman. On the Synthesis of Finite-State Machines from Samples of their Behavior. IEEE ToC '72.
[17]
S. Burckhardt, P. Kothari, M. Musuvathi, and S. Nagarakatte. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. ASPLOS '10.
[18]
M. Burger and A. Zeller. Minimizing Reproduction of Software Failures. ISSTA '11.
[19]
Y. Cai and W. Chan. Lock Trace Reduction for Multithreaded Programs. TPDS '13.
[20]
K.-h. Chang, V. Bertacco, and I. L. Markov. Simulation-Based Bug Trace Minimization with BMC-Based Refinement. IEEE TCAD '07.
[21]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, O. Fox, and E. Brewer. Pinpoint: Problem Determination in Large, Dynamic Internet Services. DSN '02.
[22]
J. Choi and A. Zeller. Isolating Failure-Inducing Thread Schedules. SIGSOFT '02.
[23]
J. Christ, E. Ermis, M. Schäf, and T. Wies. Flow-Sensitive Fault Localization. VMCAI '13.
[24]
K. Claessen and J. Hughes. QuickCheck: a Lightweight Tool for Random Testing of Haskell Programs. ICFP '00.
[25]
K. Claessen, M. Palka, N. Smallbone, J. Hughes, H. Svensson, T. Arts, and U. Wiger. Finding Race Conditions in Erlang with QuickCheck and PULSE. ICFP '09.
[26]
J. Clause and A. Orso. A Technique for Enabling and Supporting Debugging of Field Failures. ICSE '07.
[27]
H. Cleve and A. Zeller. Locating Causes of Program Failures. ICSE '05.
[28]
K. E. Coons, S. Burckhardt, and M. Musuvathi. GAMBIT: Effective Unit Testing for Concurrency Libraries. PPoPP '10.
[29]
G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. ReVirt: Enabling Intrusion Analysis Through Virtual-Machine Logging and Replay. OSDI '02.
[30]
M. A. El-Zawawy and M. N. Alanazi. An Efficient Binary Technique for Frace Simplifications of Concurrent Programs. ICAST '14.
[31]
A. Elyasov, I. W. B. Prasetya, and J. Hage. Guided Algebraic Specification Mining for Failure Simplification. TSS '13.
[32]
M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynamically Discovering Likely Program Invariants to Support Program Evolution. IEEE ToSE '01.
[33]
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of Distributed Consensus with One Faulty Process. JACM '85.
[34]
C. Flanagan and P. Godefroid. Dynamic Partial-Order Reduction for Model Checking Software. POPL '05.
[35]
P. Fonseca, R. Rodrigues, and B. B. Brandenburg. SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration. OSDI '14.
[36]
D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay Debugging For Distributed Applications. ATC '06.
[37]
P. Godefroid and N. Nagappan. Concurrency at Microsoft - An Exploratory Survey. CAV '08.
[38]
P. Godefroid, J. van Leeuwen, J. Hartmanis, G. Goos, and P. Wolper. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. PhD Thesis, '95.
[39]
D. Gupta, K. Yocum, M. Mcnett, A. C. Snoeren, A. Vahdat, and G. M. Voelker. To Infinity and Beyond: Time-Warped Network Emulation. NSDI '06.
[40]
M. Hammoudi, B. Burg, Gigon, and G. Rothermel. On the Use of Delta Debugging to Reduce Recordings and Facilitate Debugging of Web Applications. ESEC/FSE '15.
[41]
J. Huang and C. Zhang. An Efficient Static Trace Simplification Technique for Debugging Concurrent Programs. SAS '11.
[42]
J. Huang and C. Zhang. LEAN: Simplifying Concurrency Bug Reproduction via Replay-Supported Execution Reduction. OOPSLA '12.
[43]
J. M. Hughes. Personal Communication.
[44]
J. M. Hughes and H. Bolinder. Testing a Database for Race Conditions with QuickCheck. Erlang '11.
[45]
J. A. Jones and M. J. Harrold and J. Stasko. Visualization of Test Information To Assist Fault Localization. ICSE '02.
[46]
N. Jalbert and K. Sen. A Trace Simplification Technique for Effective Debugging of Concurrent Programs. FSE '10.
[47]
W. Jin and A. Orso. F3: Fault Localization for Field Failures. ISSTA '13.
[48]
B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures. SOSP '15.
[49]
S. Khoshnood, M. Kusano, and C. Wang. ConcBugAssist: Constraint Solving for Diagnosis and Repair of Concurrency Bugs. ISSTA '15.
[50]
G. Kiczales, E. Hilsdale, J. Hugunin, M. Kersten, J. Palm, and W. G. Griswold. An Overview of AspectJ. ECOOP '01.
[51]
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. CACM '78.
[52]
T. D. LaToza, G. Venolia, and R. DeLine. Maintaining Mental Models: a Study of Developer Work Habits. ICSE '06.
[53]
S. Lauterburg, R. K. Karmani, D. Marinov, and G. Agha. Evaluating Ordering Heuristics for Dynamic Partial-Order Reduction Techniques. FASE '10.
[54]
K. H. Lee, Y. Zheng, N. Sumner, and X. Zhang. Toward Generating Reducible Replay Logs. PLDI '11.
[55]
T. Leesatapornwongsa, M. Hao, P. Joshi, J. F. Lukman, and H. S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. OSDI '14.
[56]
C.-C. Lin, V. Jalaparti, M. Caesar, and J. Van der Merwe. DEFINED: Deterministic Execution for Interactive Control-Plane Debugging. ATC '13.
[57]
H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI '09.
[58]
D. Lorenzoli, L. Mariani, and M. Pezzè. Automatic Generation of Software Behavioral Models. ICSE '08.
[59]
J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining Invariants from Console Logs for System Problem Detection. ATC '10.
[60]
M. Jose and R. Majmudar. Cause Clue Causes: Error Localization Using Maximum Satisfiability. PLDI '11.
[61]
N. Machado, B. Lucia, and L. Rodrigues. Concurrency Debugging with Differential Schedule Projections. PLDI '15.
[62]
G. A. Miller. The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. Psychological Review '56.
[63]
M. Musuvathi and S. Qadeer. Iterative Context Bounding for Systematic Testing of Multithreaded Programs. PLDI '07.
[64]
M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. SOSP '08.
[65]
D. Ongaro and J. Ousterhout. In Search of an Understandable Consensus Algorithm. ATC '14.
[66]
S. Park, S. Lu, and Y. Zhou. CTrigger: Exposing Atomicity Violation Bugs from their Hiding Places. ASPLOS '09.
[67]
S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES: Probabilistic Replay with Execution Sketching on Multiprocessors. SOSP '09.
[68]
S. M. Park. Effective Fault Localization Techniques for Concurrent Software. PhD Thesis, '14.
[69]
J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang. Test-case Reduction for C Compiler Bugs. PLDI '12.
[70]
C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang, Z. Liu, A. El-Hassany, S. Whitlock, H. Acharya, K. Zarifis, and S. Shenker. Troubleshooting Blackbox SDN Control Software with Minimal Causal Sequences. SIGCOMM '14.
[71]
O. Shacham, E. Yahav, G. G. Gueta, A. Aiken, N. Bronson, M. Sagiv, and M. Vechev. Verifying Atomicity via Data Independence. ISSTA'14.
[72]
J. Simsa, R. Bryant, and G. A. Gibson. dBug: Systematic Evaluation of Distributed Systems. SSV '10.
[73]
W. Sumner and X. Zhang. Comparative Causality: Explaining the Differences Between Executions. ICSE '13.
[74]
S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction. ISSTA '07.
[75]
V. Terragni, S.-C. Cheung, and C. Zhang. RECONTEST: Effective Regression Testing of Concurrent Programs. ICSE '15.
[76]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing Production Run Failures at the User's Site. SOSP '07.
[77]
Twitter Blog. Diffy: Testing Services Without Writing Tests. https://blog.twitter.com/2015/diffy-testing-services-without-writing-tests.
[78]
R. Tzoref, S. Ur, and E. Yom-Tov. Instrumenting Where it Hurts: An Automatic Concurrent Debugging Technique. ISSTA '07.
[79]
J. Wang, W. Dou, C. Gao, and J. Wei. Fast Reproducing Web Application Errors. ISSRE '15.
[80]
M. Weiser. Program Slicing. ICSE '81.
[81]
A. Whitaker, R. Cox, and S. Gribble. Configuration Debugging as Search: Finding the Needle in the Haystack. SOSP '04.
[82]
P. Wolper. Expressing Interesting Properties of Programs in Propositional Temporal Logic. POPL '86.
[83]
J. Xuan and M. Monperrus. Test Case Purification for Improving Fault Localization. FSE '14.
[84]
M. Yabandeh, N. Knezevic, D. Kostic, and V. Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. NSDI '09.
[85]
M. Yabandeh and D. Kostic. DPOR-DS: Dynamic Partial Order Reduction in Distributed Systems. 2009 Tech Report.
[86]
X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and Understanding Bugs in C Compilers. PLDI '11.
[87]
Y. Yang, X. Chen, G. Gopalakrishnan, and R. M. Kirby. Efficient Stateful Dynamic Partial Order Reduction. MCS '08.
[88]
X. Yi, J. Wang, and X. Yang. Stateful Dynamic Partial-Order Reduction. FMSE '06.
[89]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. ASPLOS '10.
[90]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI '12.
[91]
C. Zamfir, G. Altekar, G. Candea, and I. Stoica. Debug Determinism: The Sweet Spot for Replay-Based Debugging. HotOS '11.
[92]
C. Zamfir and G. Candea. Execution Synthesis: A Technique for Automated Software Debugging. EuroSys '10.
[93]
A. Zeller. Yesterday, my program worked. Today, it does not. Why? ESEC/FSE '99.
[94]
A. Zeller and R. Hildebrandt. Simplifying and Isolating Failure-Inducing Input. IEEE TSE '02.
[95]
S. Zhang and C. Zhang. Software Bug Localization with Markov Logic. ICSE '14.

Cited By

View all
  • (2022)OpenGL API call trace reduction with the minimizing Delta debugging algorithmProceedings of the 13th International Workshop on Automating Test Case Design, Selection and Evaluation10.1145/3548659.3561308(53-56)Online publication date: 7-Nov-2022
  • (2021)TardisProceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR)10.1145/3482898.3483355(108-121)Online publication date: 11-Oct-2021
  • (2019)FlyMCProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303986(1-16)Online publication date: 25-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NSDI'16: Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation
March 2016
699 pages
ISBN:9781931971294

Sponsors

  • VMware
  • Google Inc.
  • Microsoft Research: Microsoft Research
  • Facebook: Facebook

Publisher

USENIX Association

United States

Publication History

Published: 16 March 2016

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)OpenGL API call trace reduction with the minimizing Delta debugging algorithmProceedings of the 13th International Workshop on Automating Test Case Design, Selection and Evaluation10.1145/3548659.3561308(53-56)Online publication date: 7-Nov-2022
  • (2021)TardisProceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR)10.1145/3482898.3483355(108-121)Online publication date: 11-Oct-2021
  • (2019)FlyMCProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303986(1-16)Online publication date: 25-Mar-2019
  • (2019)Teaching Rigorous Distributed Systems With Efficient Model CheckingProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303947(1-15)Online publication date: 25-Mar-2019
  • (2019)From C to interaction trees: specifying, verifying, and testing a networked serverProceedings of the 8th ACM SIGPLAN International Conference on Certified Programs and Proofs10.1145/3293880.3294106(234-248)Online publication date: 14-Jan-2019
  • (2018)Net2TextProceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307493(609-623)Online publication date: 9-Apr-2018
  • (2018)HDDr: a recursive variant of the hierarchical Delta debugging algorithmProceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation10.1145/3278186.3278189(16-22)Online publication date: 5-Nov-2018
  • (2018)Towards concurrency race debuggingProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243206(1-13)Online publication date: 1-Nov-2018
  • (2018)iDeA: an immersive debugger for actorsProceedings of the 17th ACM SIGPLAN International Workshop on Erlang10.1145/3239332.3242762(1-12)Online publication date: 29-Sep-2018
  • (2018)An empirical study on crash recovery bugs in large-scale distributed systemsProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3236030(539-550)Online publication date: 26-Oct-2018
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media