Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1929004.1929007guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

dBug: systematic evaluation of distributed systems

Published: 06 October 2010 Publication History

Abstract

This paper presents the design, implementation and evaluation of "dBug" - a tool that leverages manual instrumentation for systematic evaluation of distributed and concurrent systems. Specifically, for a given distributed concurrent system, its initial state and a workload, the dBug tool systematically explores possible orders in which concurrent events triggered by the workload can happen. Further, dBug optionally uses the partial order reduction mechanism to avoid exploration of equivalent orders. Provided with a correctness check, the dBug tool is able to verify that all possible serializations of a given concurrent workload execute correctly. Upon encountering an error, the tool produces a trace that can be replayed to investigate the error.
We applied the dBug tool to two distributed systems - the Parallel Virtual File System (PVFS) implemented in C and the FAWN-based key-value storage (FAWN-KV) implemented in C++. In particular, we integrated both systems with dBug to expose the non-determinism due to concurrency. This mechanism was used to verify that the result of concurrent execution of a number of basic operations from a fixed initial state meets the high-level specification of PVFS and FAWN-KV. The experimental evidence shows that the dBug tool is capable of systematically exploring behaviors of a distributed system in a modular, practical, and effective manner.

References

[1]
ANDERSEN, D. G., FRANKLIN, J., KAMINSKY, M., PHANISHAYEE, A., TAN, L., AND VASUDEVAN, V. Fawn: a fast array of wimpy nodes. In SOSP'09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (New York, NY, USA, 2009), ACM, pp. 1-14.
[2]
BALL, T., BOUNIMOVA, E., COOK, B., LEVIN, V., LICHTENBERG, J., MCGARVEY, C., ONDRUSEK, B., RAJAMANI, S. K., AND USTUNER, A. Thorough static analysis of device drivers. In EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems (New York, NY, USA, 2006), ACM, pp. 73-85.
[3]
CADAR, C., DUNBAR, D., AND ENGLER, D. R. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI '08: Proceedings of the 8th Conference on USENIX Symposium on Operating Systems Design and Implementation (2008), R. Draves and R. van Renesse, Eds., USENIX Association, pp. 209-224.
[4]
CARNS, P. H., LIGON, W. B., III, ROSS, R. B., AND THAKUR, R. PVFS: A parallel file system for Linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference (2000), USENIX Association, pp. 317-327.
[5]
CHANG, C.-L., AND LEE, R. C.-T. Symbolic Logic and Mechanical Theorem Proving. Academic Press, Inc., Orlando, FL, USA, 1973.
[6]
CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable: A distributed storage system for structured data. In OSDI '06: Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation (2006), pp. 205-218.
[7]
CLARKE, E. M., EMERSON, E. A., AND SISTLA, A. P. Automatic verification of finite-state concurrent systems using temporal logic specifications. ACM Trans. Program. Lang. Syst. 8, 2 (1986), 244-263.
[8]
DECANDIA, G., HASTORUN, D., JAMPANI, M., KAKULAPATI, G., LAKSHMAN, A., PILCHIN, A., SIVASUBRAMANIAN, S., VOSSHALL, P., AND VOGELS, W. Dynamo: Amazon's highly available key-value store. In SOSP '07: Proceedings of 21st ACM Symposium on Operating Systems Principles (2007), pp. 205- 220.
[9]
FLANAGAN, C., AND GODEFROID, P. Dynamic partial-order reduction for model checking software. SIGPLAN Not. 40, 1 (2005), 110-121.
[10]
GELPERIN, D., AND HETZEL, B. The growth of software testing. Communications of ACM 31, 6 (1988), 687-695.
[11]
GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. SIGOPS Oper. Syst. Rev. 37, 5 (2003), 29-43.
[12]
GODEFROID, P. Model checking for programming languages using VeriSoft. In POPL '97: Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages (New York, NY, USA, 1997), ACM, pp. 174-186.
[13]
JR., E. M. C., GRUMBERG, O., AND PELED, D. A. Model Checking. The MIT Press, 1999.
[14]
KILLIAN, C. E., ANDERSON, J. W., BRAUD, R., JHALA, R., AND VAHDAT, A. M. Mace: language support for building distributed systems. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation (New York, NY, USA, 2007), ACM, pp. 179-188.
[15]
KILLIAN, C. E., ANDERSON, J. W., JHALA, R., AND VAHDAT, A. Life, death, and the critical transition: Finding liveness bugs in systems code. In NSDI '07: Proceedings of the 5th Conference on USENIX Symposium on Networked Systems Design and Implementation (2007).
[16]
KLEIN, G., ELPHINSTONE, K., HEISER, G., ANDRONICK, J., COCK, D., DERRIN, P., ELKADUWE, D., ENGELHARDT, K., KOLANSKI, R., NORRISH, M., SEWELL, T., TUCH, H., AND WINWOOD, S. seL4: Formal verification of an OS kernel. In SOSP '09: Proceedings of 22nd ACM Symposium on Operating Systems Principles (2009), J. N. Matthews and T. E. Anderson, Eds., ACM, pp. 207-220.
[17]
MORRIS, R., KARGER, D., KAASHOEK, F., AND BALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup service for internet applications. In ACM SIGCOMM 2001 (San Diego, CA, September 2001).
[18]
MUSUVATHI, M., QADEER, S., BALL, T., BASLER, G., NAINAR, P. A., AND NEAMTIU, I. Finding and reproducing heisenbugs in concurrent programs. In OSDI '08: Proceedings of the 8th Conference on USENIX Symposium on Operating Systems Design and Implementation (2008), pp. 267-280.
[19]
ROWSTRON, A., AND DRUSCHEL, P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. Middleware (2001).
[20]
SCHMUCK, F., AND HASKIN, R. GPFS: A shared-disk file system for large computing clusters. In FAST '02: Proceedings of the 2002 Conference on File and Storage Technologies (2002), pp. 231-244.
[21]
YANG, J., CHEN, T., WU, M., XU, Z., LIU, X., LIN, H., YANG, M., LONG, F., ZHANG, L., AND ZHOU, L. Modist: Transparent model checking of unmodified distributed systems. In Proceedings of the Sixth Symposium on Networked Systems Design and Implementation (NSDI '09) (April 2009), pp. 213- 228.
[22]
YANG, J., SAR, C., AND ENGLER, D. R. eXplode: A lightweight, general system for finding serious storage system errors. In OSDI '06: Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation (2006), USENIX Association, pp. 131-146.

Cited By

View all
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • (2019)DFix: automatically fixing timing bugs in distributed systemsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314620(994-1009)Online publication date: 8-Jun-2019
  • (2019)FlyMCProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303986(1-16)Online publication date: 25-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
SSV'10: Proceedings of the 5th international conference on Systems software verification
October 2010
10 pages
  • Program Chairs:
  • Ralf Huuck,
  • Gerwin Klein,
  • Bastian Schlich

Sponsors

  • Microsoft Research: Microsoft Research
  • NICTA: National Information and Communications Technology Australia

Publisher

USENIX Association

United States

Publication History

Published: 06 October 2010

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • (2019)DFix: automatically fixing timing bugs in distributed systemsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314620(994-1009)Online publication date: 8-Jun-2019
  • (2019)FlyMCProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303986(1-16)Online publication date: 25-Mar-2019
  • (2018)FCatchACM SIGPLAN Notices10.1145/3296957.317716153:2(419-431)Online publication date: 19-Mar-2018
  • (2018)CloudRaid: hunting concurrency bugs in the cloud via log-miningProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3236071(3-14)Online publication date: 26-Oct-2018
  • (2018)FCatchProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3177161(419-431)Online publication date: 19-Mar-2018
  • (2017)DCatchACM SIGARCH Computer Architecture News10.1145/3093337.303773545:1(677-691)Online publication date: 4-Apr-2017
  • (2017)DCatchACM SIGPLAN Notices10.1145/3093336.303773552:4(677-691)Online publication date: 4-Apr-2017
  • (2017)DCatchProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3037697.3037735(677-691)Online publication date: 4-Apr-2017
  • (2016)Minimizing faulty executions of distributed systemsProceedings of the 13th Usenix Conference on Networked Systems Design and Implementation10.5555/2930611.2930631(291-309)Online publication date: 16-Mar-2016
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media