Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3291168.3291208acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Sledgehammer: cluster-fueled debugging

Published: 08 October 2018 Publication History

Abstract

Current debugging tools force developers to choose between power and interactivity. Interactive debuggers such as gdb let them quickly inspect application state and monitor execution, which is perfect for simple bugs. However, they are not powerful enough for complex bugs such as wild stores and synchronization errors where developers do not know which values to inspect or when to monitor the execution. So, developers add logging, insert timing measurements, and create functions that verify invariants. Then, they re-run applications with this instrumentation. These powerful tools are, unfortunately, not interactive; they can take minutes or hours to answer one question about a complex execution, and debugging involves asking and answering many such questions.
In this paper, we propose cluster-fueled debugging, which provides interactivity for powerful debugging tools by parallelizing their work across many cores in a cluster. At sufficient scale, developers can get answers to even detailed queries in a few seconds. Sledgehammer is a cluster-fueled debugger: it improves performance by timeslicing program execution, debug instrumentation, and analysis of results, and then executing each chunk of work on a separate core. Sledgehammer enables powerful, interactive debugging tools that are infeasible today. Parallel retro-logging allows developers to change their logging instrumentation and then quickly see what the new logging would have produced on a previous execution. Continuous function evaluation logically evaluates a function such as a data-structure integrity check at every point in a program's execution. Retro-timing allows fast performance analysis of a previous execution. With 1024 cores, Sledgehammer executes these tools hundreds of times faster than single-core execution while returning identical results.

References

[1]
ATTARIYAN, M., CHOW, M., AND FLINN, J. Xray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation (Hollywood, CA, October 2012).
[2]
Bug 25520. https://bz.apache.org/bugzilla/show_bug.cgi?id=25520.
[3]
Bug 45605. https://bz.apache.org/bugzilla/show_bug.cgi?id=45605.
[4]
CANTRILL, B. M., SHAPIRO, M. W., AND LEVENTHAL, A. H. Dynamic instrumentation of production systems. In Proceedings of the 2004 USENIX Annual Technical Conference (Boston, MA, June 2004), pp. 15-28.
[5]
CHEN, P., AND NOBLE, B. When Virtual is Better Than Real. In Proceedings of the 8th IEEE Workshop on Hot Topics in Operating Systems) (Schloss Elmau, Germany, May 2001).
[6]
CHOW, J., GARFINKEL, T., AND CHEN, P. M. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of the 2008 USENIX Annual Technical Conference (June 2008), pp. 1-14.
[7]
COOPER, B. F., SILBERSTEIN, A., TAM, E., RAMAKRISHNAN, R., AND SEARS, R. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing (2010), pp. 143-154.
[8]
DEAN, J., AND BARROSO, L. A. The tail at scale. Communications of the ACM 56, 2 (February 2013), 74-80.
[9]
DEMSKY, B., ERNST, M. D., GUO, P. J., MCCARMANT, S., PERKINS, J. H., AND RINARD, M. Inference and enforcement of data structure consistency specifications. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analaysis (July 2006).
[10]
DEVECSERY, D., CHOW, M., DOU, X., FLINN, J., AND CHEN, P. M. Eidetic systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (Broomfield, CO, October 2014).
[11]
DUNLAP, G. W., KING, S. T., CINAR, S., BASRAI, M. A., AND CHEN, P. M. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Boston, MA, December 2002), pp. 211-224.
[12]
ERLINGSSON, U., PEINADO, M., PETER, S., AND BUDIU, M. Fay: Extensible distributed tracing from kernels to clusters. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (October 2011), pp. 311-326.
[13]
ERNST, M. D., COCKRELL, J., GRISWOLD, W. G., AND NOTKIN, D. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering 27, 2 (February 2001).
[14]
HONARMAND, N., AND TORRELLAS, J. Replay debugging: Leveraging record and replay for program debugging. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) (June 2014), pp. 455-456.
[15]
JOSHI, A., KING, S. T., DUNLAP, G. W., AND CHEN, P. M. Detecting past and present intrusions through vulnerability-specific predicates. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (Brighton, United Kingdom, October 2005), pp. 91-104.
[16]
KIM, T., CHANDRA, R., AND ZELDOVICH, N. Efficient patch-based auditing for Web application vulnerabilities. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation (Hollywood, CA, October 2012).
[17]
KING, S. T., DUNLAP, G. W., AND CHEN, P. M. Debugging operating systems with time-traveling virtual machines. In Proceedings of the 2005 USENIX Annual Technical Conference (April 2005), pp. 1-15.
[18]
LATTNER, C., AND ADVE, V. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the 2004 IEEE/ACM International Symposium on Code Generation and Optimization (2004).
[19]
LEFEBVRE, G., CULLY, B., HEAD, C., SPEAR, M., HUTCHINSON, N., FEELEY, M., AND WARFIELD, A. Execution Mining. In Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE) (March 2012).
[20]
LI, Z., TAN, L., WANG, X., LU, S., ZHOU, Y., AND ZHAI, C. Have things changed now?: an empirical study of bug characteristics in modern open source software. In Proceedings of the 1st workshop on Architectural and system support for improving software dependability (2006), ACM, pp. 25-33.
[21]
LU, S., PARK, S., SEO, E., AND ZHOU, Y. Learning from mistakes--a comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (2008), pp. 329-339.
[22]
LUK, C.-K., COHN, R., MUTH, R., PATIL, H., KLAUSER, A., LOWNEY, G., WALLACE, S., REDDI, V. J., AND HAZELWOOD, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (Chicago, IL, June 2005), pp. 190-200.
[23]
MACE, J., ROELKE, R., AND FONSECA, R. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (2015).
[24]
MCCONNELL, S. Code complete. Pearson Education, 2004.
[25]
memtier_benchmark: A high-throughput benchmarking tool for redis & memcached, June 2013. https://redislabs.com/blog/memtier_benchmark-a-high-throughput-benchmarking-tool-for-redis-memcached/.
[26]
NETHERCOTE, N., AND SEWARD, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (San Diego, CA, June 2007).
[27]
O'CALLAHAN, R., JONES, C., FROYD, N., HUEY, K., NOLL, A., AND PARTUSH, N. Engineering record and replay for deployability. In Proceedings of the 2017 USENIX Annual Technical Conference (Santa Clara, CA, July 2017).
[28]
PRASAD, V., COHEN, W., EIGLER, F. C., HUNT, M., KENISTON, J., AND CHEN, B. Locating system problems using dynamic instrumentation. In Proceedings of the Linux Symposium (Ottawa, ON, Canada, July 2005), pp. 49-64.
[29]
QUINN, A., DEVECSERY, D., CHEN, P. M., AND FLINN, J. JetStream: Cluster-scale parallelization of information flow queries. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (Savannah, GA, November 2016).
[30]
RICCI, R., EIDE, E., AND THE CLOUDLAB TEAM. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (Dec. 2014).
[31]
rr: lightweight recording and deterministic debugging. http://www.rr-project.org.
[32]
SRINIVASAN, S., ANDREWS, C., KANDULA, S., AND ZHOU, Y. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. In Proceedings of the 2004 USENIX Annual Technical Conference (Boston, MA, June 2004), pp. 29-44.
[33]
TALLENT, N. R., MELLOR-CRUMMEY, J. M., AND PORTERFIELD, A. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 2010), PPoPP '10, ACM, pp. 269-280.
[34]
VEERARAGHAVAN, K., CHEN, P. M., FLINN, J., AND NARAYANASAMY, S. Detecting and surviving data races using complementary schedules. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (Cascais, Portugal, October 2011).
[35]
VEERARAGHAVAN, K., LEE, D., WESTER, B., OUYANG, J., CHEN, P. M., FLINN, J., AND NARAYANASAMY, S. DoublePlay: Parallelizing sequential logging and replay. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (Long Beach, CA, March 2011).
[36]
VIENNOT, N., NAIR, S., AND NIEH, J. Transparent mutable replay for multicore debugging and patch validation. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (March 2013).
[37]
YU, J., AND NARAYANASAMY, S. A case for an interleaving constrained shared-memory multiprocessor. In Proceedings of the 36th International Symposium on Computer Architecture (June 2009), pp. 325-336.
[38]
YUAN, D., PARK, S., AND ZHOU, Y. Characterising logging practices in open-source software. In Proceedings of the 34th International Conference on Software Engineering (ICSE) (Zurich, Switzerland, June 2012).

Cited By

View all
  • (2019)You can't debug what you can't seeProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321428(163-169)Online publication date: 13-May-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
October 2018
815 pages
ISBN:9781931971478

Sponsors

  • NetApp
  • Google Inc.
  • NSF
  • Microsoft: Microsoft
  • Facebook: Facebook

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 08 October 2018

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)You can't debug what you can't seeProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321428(163-169)Online publication date: 13-May-2019

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media