research-article

Failure sketching: a technique for automated root cause diagnosis of in-production failures

Authors:

Benjamin Schubert,

Cristiano Pereira,

George CandeaAuthors Info & Claims

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Pages 344 - 360

https://doi.org/10.1145/2815400.2815412

Published: 04 October 2015 Publication History

Abstract

Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.

We propose failure sketching, an automated debugging technique that provides developers with an explanation ("failure sketch") of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis.

We built Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace). We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.

Supplementary Material

MP4 File (p344.mp4)

Download
2139.82 MB

References

[1]

Abreu, R., Zoeteweij, P., and Gemund, A. J. C. V. An evaluation of similarity coefficients for software fault localization. In PRDC (2006).

Digital Library

[2]

Altekar, G., and Stoica, I. ODR: Output-deterministic replay for multicore programs. In Symp. on Operating Systems Principles (2009).

Digital Library

[3]

Ammons, G., and Larus, J. R. Improving data-flow analysis with path profiles. In Intl. Conf. on Programming Language Design and Implem. (1994).

Digital Library

[4]

Arulraj, J., Chang, P.-C., Jin, G., and Lu, S. Production-run software failure diagnosis via hardware performance counters. In ASPLOS (2013).

Digital Library

[5]

Arulraj, J., Jin, G., and Lu, S. Leveraging the short-term memory of hardware to diagnose production-run software failures. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2014).

Digital Library

[6]

Arumuga Nainar, P., and Liblit, B. Adaptive bug isolation. In Intl. Conf. on Software Engineering (2010).

Digital Library

[7]

Baris Kasikci, Benjamin Schubert, G. C. Gist. http://dslab.epfl.ch/proj/gist/, 2015.

[8]

Beeman Strong. Debug and fine-grain profiling with intel processor trace. http://bit.ly/1xMYbIC, 2014.

[9]

Bruening, D., Garnett, T., and Amarasinghe, S. An infrastructure for adaptive dynamic optimization. In CGO (2003).

Digital Library

[10]

Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.

Digital Library

[11]

Chen, H., Yu, J., Chen, R., Zang, B., and Yew, P.-C. Polus: A powerful live updating system. In ICSE (2007).

Digital Library

[12]

Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., and Vaswani, K. HOLMES: Effective statistical debugging via efficient path profiling. In Intl. Conf. on Software Engineering (2009).

Digital Library

[13]

Choi, J.-D., and Zeller, A. Isolating failure-inducing thread schedules. In ISSTA (2002).

Digital Library

[14]

Dunlap, G. W., Lucchetti, D., Chen, P. M., and Fetterman, M. Execution replay on multiprocessor virtual machines. In Intl. Conf. on Virtual Execution Environments (2008).

Digital Library

[15]

Fitzpatrick, B. Memcached. http://memcached.org, 2013.

[16]

Gilchrist, J. Parallel BZIP2. http://compression.ca/pbzip2, 2015.

[17]

Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. Debugging in the (very) large: ten years of implementation and experience. In Symp. on Operating Systems Principles (2009).

Digital Library

[18]

Godefroid, P., and Nagappan, N. Concurrency at Microsoft -- An exploratory survey. In Intl. Conf. on Computer Aided Verification (2008).

[19]

Hauswirth, M., and Chilimbi, T. M. Low-overhead memory leak detection using adaptive statistical profiling. In ASPLOS (2004).

Digital Library

[20]

Hower, D. R., and Hill, M. D. Rerun: Exploiting episodes for lightweight memory race recording. ISCA.

Digital Library

[21]

Apache httpd. http://httpd.apache.org, 2013.

[22]

Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 2. 2015.

[23]

Intel Corporation. Intel processor trace. https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing, 2013.

[24]

Linux branch with intel pt support. https://github.com/virtuoso/linux-perf/tree/intel_pt, 2015.

[25]

Jin, G., Thakur, A., Liblit, B., and Lu, S. Instrumentation and sampling strategies for cooperative concurrency bug isolation. SIGPLAN Not. (2010).

Digital Library

[26]

Jin, G., Zhang, W., Deng, D., Liblit, B., and Lu, S. Automated concurrency-bug fixing. In OSDI (2012).

Digital Library

[27]

John Criswell. The information flow compiler. https://llvm.org/svn/llvm-project/giri/, 2011.

[28]

Jones, J. A., and Harrold, M. J. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE (2005).

Digital Library

[29]

Kasikci, B., Ball, T., Candea, G., Erickson, J., and Musuvathi, M. Efficient tracing of cold code via bias-free sampling. In USENIX Annual Technical Conf. (2014).

Digital Library

[30]

Kasikci, B., Pereira, C., Pokam, G., Schubert, B., Musuvathi, M., and Candea, G. Failure sketches: A better way to debug. In Workshop on Hot Topics in Operating Systems (2015).

Digital Library

[31]

Kasikci, B., Zamfir, C., and Candea, G. Data races vs. data race bugs: Telling the difference with Portend. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2012).

Digital Library

[32]

Kasikci, B., Zamfir, C., and Candea, G. Race-Mob: Crowdsourced data race detection. In Symp. on Operating Systems Principles (2013).

Digital Library

[33]

Kendall, M. G. A new measure of rank correlation. Biometrika (1938).

[34]

Kleen, A. simple-pt linux driver. https://github.com/andikleen/simple-pt, 2015.

[35]

Lattner, C. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana-Champaign, May 2005.

Digital Library

[36]

Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong program analysis and transformation. In Intl. Symp. on Code Generation and Optimization (2004).

Digital Library

[37]

Liblit, B., Naik, M., Zheng, A. X., Aiken, A., and Jordan, M. I. Scalable statistical bug isolation. In Intl. Conf. on Programming Language Design and Implem. (2005).

Digital Library

[38]

Liblit, B. R. Cooperative Bug Isolation. PhD thesis, University of California, Berkeley, Dec. 2004.

Digital Library

[39]

Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes -- A comprehensive study on real world concurrency bug characteristics. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2008).

Digital Library

[40]

Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. PIN: building customized program analysis tools with dynamic instrumentation. In Intl. Conf. on Programming Language Design and Implem. (2005).

Digital Library

[41]

Machado, N., Lucia, B., and Rodrigues, L. Concurrency debugging with differential schedule projections. In PLDI (2015).

Digital Library

[42]

Manevich, R., Sridharan, M., Adams, S., Das, M., and Yang, Z. PSE: explaining program failures via postmortem static analysis. In Symp. on the Foundations of Software Eng. (2004).

Digital Library

[43]

Marjamãdki, D. Cppcheck. http://cppcheck.sourceforge.net/, 2015.

[44]

McConnell, S. Code Complete. Microsoft Press, 2004.

[45]

Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., Kunchithapadam, K., and Newhall, T. The paradyn parallel performance measurement tool. Computer (1995).

Digital Library

[46]

Montesinos, P., Ceze, L., and Torrellas, J. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. ISCA (2008).

Digital Library

[47]

Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. Finding and reproducing Heisenbugs in concurrent programs. In Symp. on Operating Sys. Design and Implem. (2008).

Digital Library

[48]

Novark, G., Berger, E. D., and Zorn, B. G. Exterminator: Automatically correcting memory errors with high probability. In Intl. Conf. on Programming Language Design and Implem. (2007).

Digital Library

[49]

Papamarcos, M. S., and Patel, J. H. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA (1984).

Digital Library

[50]

Park, S., Xiong, W., Yin, Z., Kaushik, R., Lee, K. H., Lu, S., and Zhou, Y. PRES: Probabilistic replay with execution sketching on multiprocessors. In Symp. on Operating Systems Principles (2009).

Digital Library

[51]

Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically patching errors in deployed software. In Symp. on Operating Sys. Design and Implem. (2010).

Digital Library

[52]

Pokam, G., Pereira, C., Hu, S., Adl-Tabatabai, A.-R., Gottschlich, J., Ha, J., and Wu, Y. Coreracer: A practical memory race recorder for multicore x86 tso processors. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (2011), MICRO-44, ACM, pp. 216--225.

Digital Library

[53]

Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. Rx: Treating bugs as allergies -- a safe method to survive software failures. ACM Transactions on Computer Systems 25, 3 (2007).

Digital Library

[54]

Quora. What is a coder's worst nightmware? http://www.quora.com/What-is-a-coders-worst-nightmare.

[55]

Rastislav Bodik, S. A. Path-sensitive value-flow analysis. In Symp. on Principles of Programming Languages (1998).

Digital Library

[56]

Rijsbergen, C. J. V. Information Retrieval. Butterworth-Heinemann, 1979.

Digital Library

[57]

Sadowski, C., and Yi, J. How developers use data race detection tools. In Proceedings of the 5th Workshop on Evaluation and Usability of Programming Languages and Tools (2014), PLATEAU.

Digital Library

[58]

Sahoo, S. K., Criswell, J., Geigle, C., and Adve, V. Using likely invariants for automated software fault localization. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2013).

Digital Library

[59]

Slaby, J. Llvm slicer. https://github.com/jirislaby/LLVMSlicer/, 2014.

[60]

SQLite. http://www.sqlite.org/, 2013.

[61]

Stenberg, D. Curl bug 965. http://sourceforge.net/p/curl/bugs/965/, 2013.

[62]

Stenberg, D. Curl. http://curl.haxx.se/, 2015.

[63]

Stoddard, B. Apache bug 21287. https://bz.apache.org/bugzilla/show_bug.cgi?id=21287, 2003.

[64]

Sweeney, L. K-Anonymity: A model for protecting privacy. In Intl. Journal on Uncertainty, Fuzziness and Knowledge-based Systems (2002).

Digital Library

[65]

The Associated Press. Northeastern blackout bug. http://www.securityfocus.com/news/8032, 2004.

[66]

Transmission. http://www.transmissionbt.com/, 2015.

[67]

Tucek, J., Lu, S., Huang, C., Xanthos, S., and Zhou, Y. Triage: diagnosing production run failures at the user's site. In Symp. on Operating Systems Principles (2007).

Digital Library

[68]

Veeraraghavan, K., Lee, D., Wester, B., Ouyang, J., Chen, P. M., Flinn, J., and Narayanasamy, S. Doubleplay: Parallelizing sequential logging and replay. TOCS 30, 1 (2012).

Digital Library

[69]

Wang, Y., Patil, H., Pereira, C., Lueck, G., Gupta, R., and Neamtiu, I. Drdebug: Deterministic replay based cyclic debugging with dynamic slicing. In CGO (2014).

Digital Library

[70]

Weidendorfer, J. Kcachegrind. http://kcachegrind.sourceforge.net/html/Home.html, 2015.

[71]

Weining Gu, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Zhen-Yu Yang. Characterization of linux kernel behavior under errors, 2003.

[72]

Weiser, M. Program slicing. In Intl. Conf. on Software Engineering (1981).

Digital Library

[73]

Wheeler, D. SLOCCount. http://www.dwheeler.com/sloccount/, 2010.

[74]

Wilson, P. F., Dell, L. D., and Anderson, G. F. Root Cause Analysis : A Tool for Total Quality Management. American Society for Quality, 1993.

[75]

Wu, J., Cui, H., and Yang, J. Bypassing races in live applications with execution filters. In Symp. on Operating Sys. Design and Implem. (2010).

Digital Library

[76]

Xin, B., Sumner, W. N., and Zhang, X. Efficient program execution indexing. In Intl. Conf. on Programming Language Design and Implem. (2008).

Digital Library

[77]

Yu, J., and Narayanasamy, S. A case for an interleaving constrained shared-memory multi-processor. In Intl. Symp. on Computer Architecture (2009).

Digital Library

[78]

Yu, Y., Rodeheffer, T., and Chen, W. Racetrack: Efficient detection of data race conditions via adaptive tracking. In Symp. on Operating Systems Principles (2005).

Digital Library

[79]

Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasupathy, S. SherLog: error diagnosis by connecting clues from run-time logs. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2010).

Digital Library

[80]

Zamfir, C., Altekar, G., Candea, G., and Stoica, I. Debug determinism: The sweet spot for replay-based debugging. In Workshop on Hot Topics in Operating Systems (2011).

Digital Library

[81]

Zeller, A., and Hildebrandt, R. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering (2002).

Digital Library

[82]

Zhang, W., Lim, J., Olichandran, R., Scherpelz, J., Jin, G., Lu, S., and Reps, T. ConSeq: Detecting concurrency bugs through sequential errors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2011).

Digital Library

Cited By

Qiu KRoychoudhury APaiva AAbreu RStorey M(2024)Autonomic Testing: Testing with Scenarios from ProductionProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3639802(156-158)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3639802
Wang DGalster MMorales-Trujillo M(2024)A systematic mapping study of bug reproduction and localizationInformation and Software Technology10.1016/j.infsof.2023.107338165:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107338
Gatla OZhang DXu WZheng M(2023)Understanding Persistent-memory-related Issues in the Linux KernelACM Transactions on Storage10.1145/360594619:4(1-28)Online publication date: 3-Oct-2023
https://dl.acm.org/doi/10.1145/3605946
Show More Cited By

Index Terms

Failure sketching: a technique for automated root cause diagnosis of in-production failures

Recommendations

Lazy Diagnosis of In-Production Concurrency Bugs
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

Diagnosing concurrency bugs---the process of understanding the root causes of concurrency failures---is hard. Developers depend on reproducing concurrency bugs to diagnose them. Traditionally, systems that attempt to reproduce concurrency bugs record ...
The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a ...
Pivot tracing: dynamic causal monitoring for distributed systems
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

October 2015

499 pages

ISBN:9781450338349

DOI:10.1145/2815400

General Chair:
Ethan Miller
UC Santa Cruz
,
Program Chair:
Steven Hand
Google

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SSRC: Storage Systems Research Center, UC Santa Cruz
SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

European Research Council

Conference

SOSP '15

Sponsor:

SSRC
SIGOPS

SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles

October 4 - 7, 2015

California, Monterey

Acceptance Rates

SOSP '15 Paper Acceptance Rate 30 of 181 submissions, 17%;

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
1,495
Total Downloads

Downloads (Last 12 months)98
Downloads (Last 6 weeks)7

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qiu KRoychoudhury APaiva AAbreu RStorey M(2024)Autonomic Testing: Testing with Scenarios from ProductionProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3639802(156-158)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3639802
Wang DGalster MMorales-Trujillo M(2024)A systematic mapping study of bug reproduction and localizationInformation and Software Technology10.1016/j.infsof.2023.107338165:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.infsof.2023.107338
Gatla OZhang DXu WZheng M(2023)Understanding Persistent-memory-related Issues in the Linux KernelACM Transactions on Storage10.1145/360594619:4(1-28)Online publication date: 3-Oct-2023
https://dl.acm.org/doi/10.1145/3605946
Zhang YHu YLi HShi WNing ZLuo XZhang FJust RFraser G(2023)Alligator in Vest: A Practical Failure-Diagnosis Framework via Arm Hardware FeaturesProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598106(917-928)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598106
Jeong DJung MLee YLee BShin IKwon YFedorova ANarayanan DDi Luna GQuerzoni L(2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567486
Pearce HTan BAhmad BKarri RDolan-Gavitt B(2023)Examining Zero-Shot Vulnerability Repair with Large Language Models2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179420(2339-2356)Online publication date: May-2023
https://doi.org/10.1109/SP46215.2023.10179420
Pearce HTan BAhmad BKarri RDolan-Gavitt B(2023)Examining Zero-Shot Vulnerability Repair with Large Language Models2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179324(2339-2356)Online publication date: May-2023
https://doi.org/10.1109/SP46215.2023.10179324
Zeng JZhang CLiang ZYin HStavrou ACremers CShi E(2022)PalanTírProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3560570(3135-3149)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3560570
Qiu ZShao SZhao QKhan HHui XJin GLo DMcIntosh SNovielli N(2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1145/3524842.3528463
Ma JZuo GLoughlin KZhang HQuinn AKasikci BFalsafi BFerdman MLu SWenisch T(2022)Debugging in the brave new world of reconfigurable hardwareProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507701(946-962)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507701
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents