Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1973430.1973449guideproceedingsArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

WiDS checker: combating bugs in distributed systems

Published: 11 April 2007 Publication History

Abstract

Despite many efforts, the predominant practice of debugging a distributed system is still printf-based log mining, which is both tedious and error-prone. In this paper, we present WiDS Checker, a unified framework that can check distributed systems through both simulation and reproduced runs from real deployment. All instances of a distributed system can be executed within one simulation process, multiplexed properly to observe the "happensbefore" relationship, thus accurately reveal full system state. A versatile script language allows a developer to refine system properties into straightforward assertions, which the checker inspects for violations. Combining these two components, we are able to check distributed properties that are otherwise impossible to check. We applied WiDS Checker over a suite of complex and real systems and found non-trivial bugs, including one in a previously proven Paxos specification. Our experience demonstrates the usefulness of the checker and allows us to gain insights beneficial to future research in this area.

References

[1]
Macedon: http://macedon.ucsd.edu/release/.
[2]
Phoenix compiler framework. http://research.microsoft. com/phoenix/phoenixrdk.aspx.
[3]
WiDS release. http://research.microsoft. com/research/downloads/details/ 1c205d20-6589-40cb-892b-8656fc3da090/details. aspx.
[4]
AGUILERA, M. K., MOGUL, J. C., WIENER, J. L., REYNOLDS, P., AND MUTHITACHAROEN, A. Performance debugging for distributed systems of black boxes. In SOSP. (2003).
[5]
BARHAM, P., DONNELLY, A., ISAACS, R., AND MORTIER, R. Using magpie for request extraction and workload modelling. In OSDI. (2004).
[6]
CHEN, M., KICIMAN, E., FRATKIN, E., FOX, A., AND BREWER, E. Pinpoint: Problem determination in large, dynamic, internet services. In Int. Conf. on Dependable Systems and Networks (2002).
[7]
DUNLAP, G. W., KING, S. T., CINAR, S., BASRAI, M. A., AND CHEN, P. M. Revirt: enabling intrusion analysis through virtual-machine logging and replay. SIGOPS Oper. Syst. Rev. 36, SI (2002).
[8]
GEELS, D., ALTEKAR, G., SHENKER, S., AND STOICA, I. Replay debugging for distributed applications. In USENIX. (2006).
[9]
GEELS, D., ALTEKARZ, G., MANIATIS, P., ROSCOEY, T., AND STOICAZ, I. Friday: Global comprehension for distributed replay. In NSDI. (2007).
[10]
JUMP, M., AND MCKINLEY, K. S. Cork: dynamic memory leak detection for garbage-collected languages. In POPL. (2007).
[11]
KILLIAN, C., ANDERSON, J. W., JHALA, R., AND VAHDAT, A. Life, death, and the critical transition: Finding liveness bugs in systems code. In NSDI. (2007).
[12]
LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978).
[13]
LAMPORT, L. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (1998).
[14]
LIN, S. D., PAN, A. M., GUO, R., AND ZHANG, Z. Simulating large-scale p2p systems with the wids toolkit. In MASCOTS (2005).
[15]
LIN, S. D., PAN, A. M., ZHANG, Z., GUO, R., AND GUO, Z.Y. Wids: an intergrated toolkit for distributed system development. In HotOS. (2003).
[16]
LOO, B. T., CONDIE, T., HELLERSTEIN, J. M., MANIATIS, P., ROSCOE, T., AND STOICA, I. Implementing declarative overlays. SIGOPS Oper. Syst. Rev. 39, 5 (2005).
[17]
LYNCH, N. Distributed Algorithms. 1996, ch. 8.
[18]
LYNCH, N., AND TUTTLE, M. An introduction to input/output automata. In Technical Memo MIT/LCS/TM-373. (1989).
[19]
MACCORMICK, J., MURPHY, N., NAJORK, M., THEKKATH, C. A., AND ZHOU, L. Boxwood: Abstractions as the foundation for storage infrastructure. In OSDI. (2004).
[20]
MUSUVATHI, M., AND ENGLER, D. Model checking large network protocol implementations. In NSDI. (2004).
[21]
PRISCO, R. D., LAMPSON, B. W., AND LYNCH, N. A. Fundamental study revisiting the paxos algorithm. Theoretical Computer. Science. 243, 1-2 (2000).
[22]
QIN, F., LU, S., AND ZHOU, Y. Safemem: Exploiting eccmemory for detecting memory leaks and memory corruption during production runs. In HPCA. (2005).
[23]
REYNOLDS, P., KILLIAN, C., WIENER, J. L., MOGUL, J. C., SHAH, M. A., AND VAHDAT, A. Pip: Detecting the unexpected in distributed systems. In NSDI. (2006).
[24]
RODRIGUEZ, A., KILLIAN, C., BHAT, S., KOSTIC, D., AND VAHDAT, A. Macedon: Methodology for automatically creating, evaluating, and designing overlay networks. In NSDI. (2004).
[25]
SAVAGE, S., BURROWS, M., NELSON, G., SOBALVARRO, P., AND ANDERSON, T. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst. 15, 4 (1997).
[26]
SINGH, A., ROSCOE, T., MANIATIS, P., AND DRUSCHEL, P. Using queries for distributed monitoring and forensics. In EuroSys (2006).
[27]
SRINIVASAN, S. M., KANDULA, S., ANDREWS, C. R., AND ZHOU, Y. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In USENIX. (2004).
[28]
STOICA, I., MORRIS, R., LIBEN-NOWELL, D., KARGER, D. R., KAASHOEK, M. F., DABEK, F., AND BALAKRISHNAN, H. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11, 1 (2003).
[29]
YANG, H., PIUMATTI, M., AND SINGHAL, S. K. Internet scale testing of pnrp using wids network simulator. In P2P Conference (2006).
[30]
YANG, J., TWOHEY, P., ENGLER, D., AND MUSUVATHI, M. Using model checking to find serious file system errors. ACM Trans. Comput. Syst. 24, 4 (2006).
[31]
YU, Y., RODEHEFFER, T., AND CHEN, W. Racetrack: efficient detection of data race conditions via adaptive tracking. In SOSP (2005).
[32]
ZHANG, Z., LIAN, Q., LIN, S. D., CHEN, W., CHEN, Y., AND JIN, C. Bitvault: A highly reliable distributed data retention platform. In MS Research Tech Report (MSR-TR-2005-179) (2005).

Cited By

View all
  • (2019)FlyMCProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303986(1-16)Online publication date: 25-Mar-2019
  • (2019)Teaching Rigorous Distributed Systems With Efficient Model CheckingProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303947(1-15)Online publication date: 25-Mar-2019
  • (2018)Inferring and asserting distributed system invariantsProceedings of the 40th International Conference on Software Engineering10.1145/3180155.3180199(1149-1159)Online publication date: 27-May-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NSDI'07: Proceedings of the 4th USENIX conference on Networked systems design & implementation
April 2007
27 pages

Sponsors

  • VMware
  • Google Inc.
  • Microsoft Research: Microsoft Research
  • Intel: Intel
  • CISCO

Publisher

USENIX Association

United States

Publication History

Published: 11 April 2007

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)FlyMCProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303986(1-16)Online publication date: 25-Mar-2019
  • (2019)Teaching Rigorous Distributed Systems With Efficient Model CheckingProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303947(1-15)Online publication date: 25-Mar-2019
  • (2018)Inferring and asserting distributed system invariantsProceedings of the 40th International Conference on Software Engineering10.1145/3180155.3180199(1149-1159)Online publication date: 27-May-2018
  • (2018)A Survey of Recent Trends in Testing Concurrent Software SystemsIEEE Transactions on Software Engineering10.1109/TSE.2017.270708944:8(747-783)Online publication date: 1-Aug-2018
  • (2017)DCatchACM SIGARCH Computer Architecture News10.1145/3093337.303773545:1(677-691)Online publication date: 4-Apr-2017
  • (2017)DCatchACM SIGPLAN Notices10.1145/3093336.303773552:4(677-691)Online publication date: 4-Apr-2017
  • (2017)DCatchProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3037697.3037735(677-691)Online publication date: 4-Apr-2017
  • (2016)TaxDCACM SIGARCH Computer Architecture News10.1145/2980024.287237444:2(517-530)Online publication date: 25-Mar-2016
  • (2016)TaxDCACM SIGPLAN Notices10.1145/2954679.287237451:4(517-530)Online publication date: 25-Mar-2016
  • (2016)TaxDCProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2872362.2872374(517-530)Online publication date: 25-Mar-2016
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media