Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SherLog: error diagnosis by connecting clues from run-time logs

Published: 13 March 2010 Publication History

Abstract

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users' inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors.
Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called SherLog, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log's semantics. It infers both control and data value information regarding to the failed execution.
We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.

References

[1]
H. Agrawal, R. A. DeMillo, and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience, 23(6):589--616, June 1993.
[2]
H. Agrawal, J. R. Horgan, S. London, and W. E.Wong. Fault localization using execution slices and dataflow tests. In ISSRE'95.
[3]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP'03.
[4]
A. Aiken, S. Bugrara, I. Dillig, T. Dillig, P. Hawkins, and B. Hackett. The Saturn Program Analysis System.
[5]
K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In SP '02: Proceedings of the 2002 IEEE Symposium on Security and Privacy.
[6]
A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. Traceback: First fault diagnosis by reconstruction of distributed control flow. In PLDI'05.
[7]
T. Ball, M. Naik, and S. K. Rajamani. From symptom to cause: localizing errors in counterexample traces. ACM SIGPLAN Notices, 38(1):97--105, Jan. 2003.
[8]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04.
[9]
E. Bodden, P. Lam, and L. Hendren. Finding programming errors earlier by evaluating runtime monitors ahead-of-time. In FSE'08.
[10]
C. Cadar, D. Dunbar, and D. R. Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI'08.
[11]
F. Chen and G. Rosú. Parametric trace slicing and monitoring. In TACAS'09.
[12]
T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. HOLMES: Effective statistical debugging via efficient path profiling. In ICSE'09.
[13]
V. Chipounov, V. Georgescu, C. Zamfir, and G. Candea. Selective Symbolic Execution. In HotDep'09.
[14]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP'05.
[15]
Dell. Streamlined Troubleshooting with the Dell system E--Support tool. Dell Power Solutions, 2008.
[16]
R. A. DeMillo, H. Pan, and E. H. Spafford. Critical slicing for software fault localization. In ISSTA, pages 121--134, 1996.
[17]
J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS'09.
[18]
I. Dillig, T. Dillig, and A. Aiken. Sound, complete and scalable pathsensitive analysis. SIGPLAN Not., 2008.
[19]
G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In VEE'08.
[20]
D. Engler, B. Chelf, and A. Chou. Checking system rules using system--specific, programmer--written compiler extensions. In OSDI'00.
[21]
K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL'08.
[22]
K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)large: ten years of implementation and experience. In SOSP'09, pages 103--116, New York, NY, USA, 2009. ACM.
[23]
J. Gray. Why do computers stop and what can be done about it?, 1985.
[24]
Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M.Wu,M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI'08.
[25]
R. Gupta, M. L. Soffa, and J. Howard. Hybrid slicing: integrating dynamic information with static analysis. ACMTransactions on Software Engineering and Methodology, 6(4):370--397, Oct. 1997.
[26]
S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. In PLDI '88.
[27]
W. Jiang. Understanding storage system problems and diagnosing them through log analysis. Ph.D. Dissertation.
[28]
W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In FAST'09.
[29]
S. Kandula, R. Mahajan, P. Verkaik, S. Agrawal, J. Padhye, and P. Bahl. Degailed diagnosis in enterprise networks. In SIGCOMM'09.
[30]
S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In USENIX ATC'05.
[31]
B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI'03.
[32]
Apache Logging Services -- Log4j. http://logging.apache.org/log4j.
[33]
R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Softw. Eng. Notes, 29(6):63--72, 2004.
[34]
Mozilla Quality Feedback Agent. http://support.mozilla.com/en-US/kb/quality+feedback+agent.
[35]
S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS'06.
[36]
S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA'05.
[37]
NetApp. Proactive health management with auto-support. NetApp White Paper, 2007.
[38]
M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS'09.
[39]
Squid Archives. http://www.squid-cache.org/Versions/v2/2.3/bugs/#squid-2.3.stable4-ftp_icon_not_found.
[40]
M. Sridharan, S. J. Fink, and R. Bodik. Thin slicing. In PLDI'07.
[41]
F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3:121--189, 1995.
[42]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP'07.
[43]
VMWare. Using the intergrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.
[44]
A. Whitaker, R. S. Cox, and S. D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI'04.
[45]
Windows Error Reporting(Dr.Watson). http://www.microsoft.com/whdc/maintain/StartWER.mspx.
[46]
M. Xu, R. Bodik, and M. D. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA'03.
[47]
W. Xu, L. Huang,M. Jordan, D. Patterson, and A. Fox. Mining console logs for large-scale system problem detection. In SOSP'09.
[48]
J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI'04.
[49]
Y.Xie and A.Aiken. Saturn: A scalable framework for error detection using boolean satisfiability. Transactions on Programming Language and Systems, 29(3):1---16, 2007.
[50]
A. Zeller. Isolating cause-effect chains from computer programs. In FSE'02.

Cited By

View all
  • (2022)Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice ArchitectureIEEE Transactions on Services Computing10.1109/TSC.2020.299325115:3(1399-1410)Online publication date: 1-May-2022
  • (2022)QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320249333:12(4818-4834)Online publication date: 1-Dec-2022
  • (2021)AppAngio: Revealing Contextual Information of Android App Behaviors by API-Level Audit LogsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2020.304486716(1912-1927)Online publication date: 8-Jan-2021
  • Show More Cited By

Index Terms

  1. SherLog: error diagnosis by connecting clues from run-time logs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
    ASPLOS '10
    March 2010
    399 pages
    ISSN:0163-5964
    DOI:10.1145/1735970
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
      March 2010
      422 pages
      ISBN:9781605588391
      DOI:10.1145/1736020
      • General Chair:
      • James C. Hoe,
      • Program Chair:
      • Vikram S. Adve
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 March 2010
    Published in SIGARCH Volume 38, Issue 1

    Check for updates

    Author Tags

    1. failure diagnostics
    2. log
    3. static analysis

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)118
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice ArchitectureIEEE Transactions on Services Computing10.1109/TSC.2020.299325115:3(1399-1410)Online publication date: 1-May-2022
    • (2022)QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320249333:12(4818-4834)Online publication date: 1-Dec-2022
    • (2021)AppAngio: Revealing Contextual Information of Android App Behaviors by API-Level Audit LogsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2020.304486716(1912-1927)Online publication date: 8-Jan-2021
    • (2021)LogFlash: Real-time Streaming Anomaly Detection and Diagnosis from System Logs for Large-scale Software Systems2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00021(80-90)Online publication date: Oct-2021
    • (2021)Identifying Anomaly Detection Patterns from Log Files: A Dynamic ApproachComputational Science and Its Applications – ICCSA 202110.1007/978-3-030-86960-1_36(517-532)Online publication date: 13-Sep-2021
    • (2021)sBiLSAN: Stacked Bidirectional Self-attention LSTM Network for Anomaly Detection and Diagnosis from System LogsIntelligent Systems and Applications10.1007/978-3-030-82199-9_52(777-793)Online publication date: 7-Aug-2021
    • (2021)A Systematic Mapping Study in AIOpsService-Oriented Computing – ICSOC 2020 Workshops10.1007/978-3-030-76352-7_15(110-123)Online publication date: 30-May-2021
    • (2020)Improving Logging Prediction on Imbalanced DatasetsCognitive Analytics10.4018/978-1-7998-2460-2.ch039(740-772)Online publication date: 2020
    • (2020)Logging Inter-Thread Data Dependencies in Linux KernelIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7255E103.D:7(1633-1646)Online publication date: 1-Jul-2020
    • (2020)Improving Fault-Localization Accuracy by Referencing Debugging History to Alleviate Structure Bias in Code SuspiciousnessIEEE Transactions on Reliability10.1109/TR.2020.298297569:3(1021-1049)Online publication date: Sep-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media