Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2815400.2815412acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Failure sketching: a technique for automated root cause diagnosis of in-production failures

Published: 04 October 2015 Publication History

Abstract

Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.
We propose failure sketching, an automated debugging technique that provides developers with an explanation ("failure sketch") of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis.
We built Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace). We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.

Supplementary Material

MP4 File (p344.mp4)

References

[1]
Abreu, R., Zoeteweij, P., and Gemund, A. J. C. V. An evaluation of similarity coefficients for software fault localization. In PRDC (2006).
[2]
Altekar, G., and Stoica, I. ODR: Output-deterministic replay for multicore programs. In Symp. on Operating Systems Principles (2009).
[3]
Ammons, G., and Larus, J. R. Improving data-flow analysis with path profiles. In Intl. Conf. on Programming Language Design and Implem. (1994).
[4]
Arulraj, J., Chang, P.-C., Jin, G., and Lu, S. Production-run software failure diagnosis via hardware performance counters. In ASPLOS (2013).
[5]
Arulraj, J., Jin, G., and Lu, S. Leveraging the short-term memory of hardware to diagnose production-run software failures. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2014).
[6]
Arumuga Nainar, P., and Liblit, B. Adaptive bug isolation. In Intl. Conf. on Software Engineering (2010).
[7]
Baris Kasikci, Benjamin Schubert, G. C. Gist. http://dslab.epfl.ch/proj/gist/, 2015.
[8]
Beeman Strong. Debug and fine-grain profiling with intel processor trace. http://bit.ly/1xMYbIC, 2014.
[9]
Bruening, D., Garnett, T., and Amarasinghe, S. An infrastructure for adaptive dynamic optimization. In CGO (2003).
[10]
Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.
[11]
Chen, H., Yu, J., Chen, R., Zang, B., and Yew, P.-C. Polus: A powerful live updating system. In ICSE (2007).
[12]
Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., and Vaswani, K. HOLMES: Effective statistical debugging via efficient path profiling. In Intl. Conf. on Software Engineering (2009).
[13]
Choi, J.-D., and Zeller, A. Isolating failure-inducing thread schedules. In ISSTA (2002).
[14]
Dunlap, G. W., Lucchetti, D., Chen, P. M., and Fetterman, M. Execution replay on multiprocessor virtual machines. In Intl. Conf. on Virtual Execution Environments (2008).
[15]
Fitzpatrick, B. Memcached. http://memcached.org, 2013.
[16]
Gilchrist, J. Parallel BZIP2. http://compression.ca/pbzip2, 2015.
[17]
Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. Debugging in the (very) large: ten years of implementation and experience. In Symp. on Operating Systems Principles (2009).
[18]
Godefroid, P., and Nagappan, N. Concurrency at Microsoft -- An exploratory survey. In Intl. Conf. on Computer Aided Verification (2008).
[19]
Hauswirth, M., and Chilimbi, T. M. Low-overhead memory leak detection using adaptive statistical profiling. In ASPLOS (2004).
[20]
Hower, D. R., and Hill, M. D. Rerun: Exploiting episodes for lightweight memory race recording. ISCA.
[21]
Apache httpd. http://httpd.apache.org, 2013.
[22]
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 2. 2015.
[23]
Intel Corporation. Intel processor trace. https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing, 2013.
[24]
Linux branch with intel pt support. https://github.com/virtuoso/linux-perf/tree/intel_pt, 2015.
[25]
Jin, G., Thakur, A., Liblit, B., and Lu, S. Instrumentation and sampling strategies for cooperative concurrency bug isolation. SIGPLAN Not. (2010).
[26]
Jin, G., Zhang, W., Deng, D., Liblit, B., and Lu, S. Automated concurrency-bug fixing. In OSDI (2012).
[27]
John Criswell. The information flow compiler. https://llvm.org/svn/llvm-project/giri/, 2011.
[28]
Jones, J. A., and Harrold, M. J. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE (2005).
[29]
Kasikci, B., Ball, T., Candea, G., Erickson, J., and Musuvathi, M. Efficient tracing of cold code via bias-free sampling. In USENIX Annual Technical Conf. (2014).
[30]
Kasikci, B., Pereira, C., Pokam, G., Schubert, B., Musuvathi, M., and Candea, G. Failure sketches: A better way to debug. In Workshop on Hot Topics in Operating Systems (2015).
[31]
Kasikci, B., Zamfir, C., and Candea, G. Data races vs. data race bugs: Telling the difference with Portend. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2012).
[32]
Kasikci, B., Zamfir, C., and Candea, G. Race-Mob: Crowdsourced data race detection. In Symp. on Operating Systems Principles (2013).
[33]
Kendall, M. G. A new measure of rank correlation. Biometrika (1938).
[34]
Kleen, A. simple-pt linux driver. https://github.com/andikleen/simple-pt, 2015.
[35]
Lattner, C. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana-Champaign, May 2005.
[36]
Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong program analysis and transformation. In Intl. Symp. on Code Generation and Optimization (2004).
[37]
Liblit, B., Naik, M., Zheng, A. X., Aiken, A., and Jordan, M. I. Scalable statistical bug isolation. In Intl. Conf. on Programming Language Design and Implem. (2005).
[38]
Liblit, B. R. Cooperative Bug Isolation. PhD thesis, University of California, Berkeley, Dec. 2004.
[39]
Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes -- A comprehensive study on real world concurrency bug characteristics. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2008).
[40]
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. PIN: building customized program analysis tools with dynamic instrumentation. In Intl. Conf. on Programming Language Design and Implem. (2005).
[41]
Machado, N., Lucia, B., and Rodrigues, L. Concurrency debugging with differential schedule projections. In PLDI (2015).
[42]
Manevich, R., Sridharan, M., Adams, S., Das, M., and Yang, Z. PSE: explaining program failures via postmortem static analysis. In Symp. on the Foundations of Software Eng. (2004).
[43]
Marjamãdki, D. Cppcheck. http://cppcheck.sourceforge.net/, 2015.
[44]
McConnell, S. Code Complete. Microsoft Press, 2004.
[45]
Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., Kunchithapadam, K., and Newhall, T. The paradyn parallel performance measurement tool. Computer (1995).
[46]
Montesinos, P., Ceze, L., and Torrellas, J. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. ISCA (2008).
[47]
Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. Finding and reproducing Heisenbugs in concurrent programs. In Symp. on Operating Sys. Design and Implem. (2008).
[48]
Novark, G., Berger, E. D., and Zorn, B. G. Exterminator: Automatically correcting memory errors with high probability. In Intl. Conf. on Programming Language Design and Implem. (2007).
[49]
Papamarcos, M. S., and Patel, J. H. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA (1984).
[50]
Park, S., Xiong, W., Yin, Z., Kaushik, R., Lee, K. H., Lu, S., and Zhou, Y. PRES: Probabilistic replay with execution sketching on multiprocessors. In Symp. on Operating Systems Principles (2009).
[51]
Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically patching errors in deployed software. In Symp. on Operating Sys. Design and Implem. (2010).
[52]
Pokam, G., Pereira, C., Hu, S., Adl-Tabatabai, A.-R., Gottschlich, J., Ha, J., and Wu, Y. Coreracer: A practical memory race recorder for multicore x86 tso processors. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (2011), MICRO-44, ACM, pp. 216--225.
[53]
Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. Rx: Treating bugs as allergies -- a safe method to survive software failures. ACM Transactions on Computer Systems 25, 3 (2007).
[54]
Quora. What is a coder's worst nightmware? http://www.quora.com/What-is-a-coders-worst-nightmare.
[55]
Rastislav Bodik, S. A. Path-sensitive value-flow analysis. In Symp. on Principles of Programming Languages (1998).
[56]
Rijsbergen, C. J. V. Information Retrieval. Butterworth-Heinemann, 1979.
[57]
Sadowski, C., and Yi, J. How developers use data race detection tools. In Proceedings of the 5th Workshop on Evaluation and Usability of Programming Languages and Tools (2014), PLATEAU.
[58]
Sahoo, S. K., Criswell, J., Geigle, C., and Adve, V. Using likely invariants for automated software fault localization. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2013).
[59]
Slaby, J. Llvm slicer. https://github.com/jirislaby/LLVMSlicer/, 2014.
[60]
SQLite. http://www.sqlite.org/, 2013.
[61]
Stenberg, D. Curl bug 965. http://sourceforge.net/p/curl/bugs/965/, 2013.
[62]
Stenberg, D. Curl. http://curl.haxx.se/, 2015.
[63]
Stoddard, B. Apache bug 21287. https://bz.apache.org/bugzilla/show_bug.cgi?id=21287, 2003.
[64]
Sweeney, L. K-Anonymity: A model for protecting privacy. In Intl. Journal on Uncertainty, Fuzziness and Knowledge-based Systems (2002).
[65]
The Associated Press. Northeastern blackout bug. http://www.securityfocus.com/news/8032, 2004.
[66]
Transmission. http://www.transmissionbt.com/, 2015.
[67]
Tucek, J., Lu, S., Huang, C., Xanthos, S., and Zhou, Y. Triage: diagnosing production run failures at the user's site. In Symp. on Operating Systems Principles (2007).
[68]
Veeraraghavan, K., Lee, D., Wester, B., Ouyang, J., Chen, P. M., Flinn, J., and Narayanasamy, S. Doubleplay: Parallelizing sequential logging and replay. TOCS 30, 1 (2012).
[69]
Wang, Y., Patil, H., Pereira, C., Lueck, G., Gupta, R., and Neamtiu, I. Drdebug: Deterministic replay based cyclic debugging with dynamic slicing. In CGO (2014).
[70]
Weidendorfer, J. Kcachegrind. http://kcachegrind.sourceforge.net/html/Home.html, 2015.
[71]
Weining Gu, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Zhen-Yu Yang. Characterization of linux kernel behavior under errors, 2003.
[72]
Weiser, M. Program slicing. In Intl. Conf. on Software Engineering (1981).
[73]
Wheeler, D. SLOCCount. http://www.dwheeler.com/sloccount/, 2010.
[74]
Wilson, P. F., Dell, L. D., and Anderson, G. F. Root Cause Analysis : A Tool for Total Quality Management. American Society for Quality, 1993.
[75]
Wu, J., Cui, H., and Yang, J. Bypassing races in live applications with execution filters. In Symp. on Operating Sys. Design and Implem. (2010).
[76]
Xin, B., Sumner, W. N., and Zhang, X. Efficient program execution indexing. In Intl. Conf. on Programming Language Design and Implem. (2008).
[77]
Yu, J., and Narayanasamy, S. A case for an interleaving constrained shared-memory multi-processor. In Intl. Symp. on Computer Architecture (2009).
[78]
Yu, Y., Rodeheffer, T., and Chen, W. Racetrack: Efficient detection of data race conditions via adaptive tracking. In Symp. on Operating Systems Principles (2005).
[79]
Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasupathy, S. SherLog: error diagnosis by connecting clues from run-time logs. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2010).
[80]
Zamfir, C., Altekar, G., Candea, G., and Stoica, I. Debug determinism: The sweet spot for replay-based debugging. In Workshop on Hot Topics in Operating Systems (2011).
[81]
Zeller, A., and Hildebrandt, R. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering (2002).
[82]
Zhang, W., Lim, J., Olichandran, R., Scherpelz, J., Jin, G., Lu, S., and Reps, T. ConSeq: Detecting concurrency bugs through sequential errors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2011).

Cited By

View all
  • (2024)Autonomic Testing: Testing with Scenarios from ProductionProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3639802(156-158)Online publication date: 14-Apr-2024
  • (2024)A systematic mapping study of bug reproduction and localizationInformation and Software Technology10.1016/j.infsof.2023.107338165:COnline publication date: 1-Jan-2024
  • (2023)Understanding Persistent-memory-related Issues in the Linux KernelACM Transactions on Storage10.1145/360594619:4(1-28)Online publication date: 3-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
October 2015
499 pages
ISBN:9781450338349
DOI:10.1145/2815400
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SOSP '15
Sponsor:

Acceptance Rates

SOSP '15 Paper Acceptance Rate 30 of 181 submissions, 17%;
Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)98
  • Downloads (Last 6 weeks)7
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Autonomic Testing: Testing with Scenarios from ProductionProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3639802(156-158)Online publication date: 14-Apr-2024
  • (2024)A systematic mapping study of bug reproduction and localizationInformation and Software Technology10.1016/j.infsof.2023.107338165:COnline publication date: 1-Jan-2024
  • (2023)Understanding Persistent-memory-related Issues in the Linux KernelACM Transactions on Storage10.1145/360594619:4(1-28)Online publication date: 3-Oct-2023
  • (2023)Alligator in Vest: A Practical Failure-Diagnosis Framework via Arm Hardware FeaturesProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598106(917-928)Online publication date: 12-Jul-2023
  • (2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
  • (2023)Examining Zero-Shot Vulnerability Repair with Large Language Models2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179420(2339-2356)Online publication date: May-2023
  • (2023)Examining Zero-Shot Vulnerability Repair with Large Language Models2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179324(2339-2356)Online publication date: May-2023
  • (2022)PalanTírProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3560570(3135-3149)Online publication date: 7-Nov-2022
  • (2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
  • (2022)Debugging in the brave new world of reconfigurable hardwareProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507701(946-962)Online publication date: 28-Feb-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media