research-article

SherLog: error diagnosis by connecting clues from run-time logs

Authors:

Shankar PasupathyAuthors Info & Claims

ACM SIGPLAN Notices, Volume 45, Issue 3

Pages 143 - 154

https://doi.org/10.1145/1735971.1736038

Published: 13 March 2010 Publication History

Abstract

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users' inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors.

Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called SherLog, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log's semantics. It infers both control and data value information regarding to the failed execution.

We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.

References

[1]

H. Agrawal, R. A. DeMillo, and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience, 23(6):589--616, June 1993.

Digital Library

[2]

H. Agrawal, J. R. Horgan, S. London, and W. E.Wong. Fault localization using execution slices and dataflow tests. In ISSRE'95.

[3]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP'03.

Digital Library

[4]

A. Aiken, S. Bugrara, I. Dillig, T. Dillig, P. Hawkins, and B. Hackett. The Saturn Program Analysis System.

[5]

K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In SP '02: Proceedings of the 2002 IEEE Symposium on Security and Privacy.

Digital Library

[6]

A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. Traceback: First fault diagnosis by reconstruction of distributed control flow. In PLDI'05.

Digital Library

[7]

T. Ball, M. Naik, and S. K. Rajamani. From symptom to cause: localizing errors in counterexample traces. ACM SIGPLAN Notices, 38(1):97--105, Jan. 2003.

Digital Library

[8]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04.

Digital Library

[9]

E. Bodden, P. Lam, and L. Hendren. Finding programming errors earlier by evaluating runtime monitors ahead-of-time. In FSE'08.

Digital Library

[10]

C. Cadar, D. Dunbar, and D. R. Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI'08.

Digital Library

[11]

F. Chen and G. Rosú. Parametric trace slicing and monitoring. In TACAS'09.

Digital Library

[12]

T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. HOLMES: Effective statistical debugging via efficient path profiling. In ICSE'09.

Digital Library

[13]

V. Chipounov, V. Georgescu, C. Zamfir, and G. Candea. Selective Symbolic Execution. In HotDep'09.

[14]

I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP'05.

Digital Library

[15]

Dell. Streamlined Troubleshooting with the Dell system E--Support tool. Dell Power Solutions, 2008.

[16]

R. A. DeMillo, H. Pan, and E. H. Spafford. Critical slicing for software fault localization. In ISSTA, pages 121--134, 1996.

Digital Library

[17]

J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS'09.

Digital Library

[18]

I. Dillig, T. Dillig, and A. Aiken. Sound, complete and scalable pathsensitive analysis. SIGPLAN Not., 2008.

Digital Library

[19]

G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In VEE'08.

Digital Library

[20]

D. Engler, B. Chelf, and A. Chou. Checking system rules using system--specific, programmer--written compiler extensions. In OSDI'00.

Digital Library

[21]

K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL'08.

Digital Library

[22]

K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)large: ten years of implementation and experience. In SOSP'09, pages 103--116, New York, NY, USA, 2009. ACM.

Digital Library

[23]

J. Gray. Why do computers stop and what can be done about it?, 1985.

[24]

Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M.Wu,M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI'08.

Digital Library

[25]

R. Gupta, M. L. Soffa, and J. Howard. Hybrid slicing: integrating dynamic information with static analysis. ACMTransactions on Software Engineering and Methodology, 6(4):370--397, Oct. 1997.

Digital Library

[26]

S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. In PLDI '88.

Digital Library

[27]

W. Jiang. Understanding storage system problems and diagnosing them through log analysis. Ph.D. Dissertation.

Digital Library

[28]

W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In FAST'09.

Digital Library

[29]

S. Kandula, R. Mahajan, P. Verkaik, S. Agrawal, J. Padhye, and P. Bahl. Degailed diagnosis in enterprise networks. In SIGCOMM'09.

Digital Library

[30]

S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In USENIX ATC'05.

Digital Library

[31]

B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI'03.

Digital Library

[32]

Apache Logging Services -- Log4j. http://logging.apache.org/log4j.

[33]

R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Softw. Eng. Notes, 29(6):63--72, 2004.

Digital Library

[34]

Mozilla Quality Feedback Agent. http://support.mozilla.com/en-US/kb/quality+feedback+agent.

[35]

S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS'06.

Digital Library

[36]

S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA'05.

Digital Library

[37]

NetApp. Proactive health management with auto-support. NetApp White Paper, 2007.

[38]

M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS'09.

Digital Library

[39]

Squid Archives. http://www.squid-cache.org/Versions/v2/2.3/bugs/#squid-2.3.stable4-ftp_icon_not_found.

[40]

M. Sridharan, S. J. Fink, and R. Bodik. Thin slicing. In PLDI'07.

Digital Library

[41]

F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3:121--189, 1995.

[42]

J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP'07.

Digital Library

[43]

VMWare. Using the intergrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.

[44]

A. Whitaker, R. S. Cox, and S. D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI'04.

Digital Library

[45]

Windows Error Reporting(Dr.Watson). http://www.microsoft.com/whdc/maintain/StartWER.mspx.

[46]

M. Xu, R. Bodik, and M. D. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA'03.

Digital Library

[47]

W. Xu, L. Huang,M. Jordan, D. Patterson, and A. Fox. Mining console logs for large-scale system problem detection. In SOSP'09.

[48]

J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI'04.

Digital Library

[49]

Y.Xie and A.Aiken. Saturn: A scalable framework for error detection using boolean satisfiability. Transactions on Programming Language and Systems, 29(3):1---16, 2007.

Digital Library

[50]

A. Zeller. Isolating cause-effect chains from computer programs. In FSE'02.

Digital Library

Cited By

Xue KHan QHan SShi ZQiao Y(2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
https://doi.org/10.1109/SSE62657.2024.00055
Mastropaolo AFerrari VPascarella LBavota G(2024)Log statements generation via deep learning: Widening the support provided to developersJournal of Systems and Software10.1016/j.jss.2023.111947210(111947)Online publication date: Apr-2024
https://doi.org/10.1016/j.jss.2023.111947
Hajer BArwa BHsairi LAhmadi H(2024)A blockchain integration to support failures prediction from log files in multi-agent systems technologyExpert Systems with Applications10.1016/j.eswa.2023.122122240(122122)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122122
Show More Cited By

Index Terms

SherLog: error diagnosis by connecting clues from run-time logs
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS '10

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
Improving Software Diagnosability via Log Enhancement
Special Issue APLOS 2011

Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of troubleshooting any complex software system, but further exacerbated by the paucity of information that is typically available in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 45, Issue 3

ASPLOS '10

March 2010

399 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1735971

Issue’s Table of Contents

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
March 2010
422 pages
ISBN:9781605588391
DOI:10.1145/1736020
General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Published in SIGPLAN Volume 45, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

270
Total Citations
View Citations
1,537
Total Downloads

Downloads (Last 12 months)114
Downloads (Last 6 weeks)18

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xue KHan QHan SShi ZQiao Y(2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
https://doi.org/10.1109/SSE62657.2024.00055
Mastropaolo AFerrari VPascarella LBavota G(2024)Log statements generation via deep learning: Widening the support provided to developersJournal of Systems and Software10.1016/j.jss.2023.111947210(111947)Online publication date: Apr-2024
https://doi.org/10.1016/j.jss.2023.111947
Hajer BArwa BHsairi LAhmadi H(2024)A blockchain integration to support failures prediction from log files in multi-agent systems technologyExpert Systems with Applications10.1016/j.eswa.2023.122122240(122122)Online publication date: Apr-2024
https://doi.org/10.1016/j.eswa.2023.122122
Bin Lashram AHsairi LAl Ahmadi H(2023)HCLPars: Α New Hierarchical Clustering Log Parsing MethodEngineering, Technology & Applied Science Research10.48084/etasr.601313:4(11130-11138)Online publication date: 9-Aug-2023
https://doi.org/10.48084/etasr.6013
Liu WChen T(2023)SLocator: Localizing the Origin of SQL Queries in Database-Backed Web ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2023.325370049:6(3376-3390)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TSE.2023.3253700
Wu XLi HKhomh F(2023)On the effectiveness of log representation for log-based anomaly detectionEmpirical Software Engineering10.1007/s10664-023-10364-128:6Online publication date: 9-Oct-2023
https://doi.org/10.1007/s10664-023-10364-1
Liu JKandikuppa ABates A(2022)Transparent DIFC: Harnessing Innate Application Event Logging for Fine-Grained Decentralized Information Flow Control2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P)10.1109/EuroSP53844.2022.00037(487-501)Online publication date: Jun-2022
https://doi.org/10.1109/EuroSP53844.2022.00037
Fu YYan MXu ZXia XZhang XYang D(2022)An empirical study of the impact of log parsers on the performance of log-based anomaly detectionEmpirical Software Engineering10.1007/s10664-022-10214-628:1Online publication date: 8-Nov-2022
https://dl.acm.org/doi/10.1007/s10664-022-10214-6
Farzad AGulliver T(2022)Log message anomaly detection with fuzzy C-means and MLPApplied Intelligence10.1007/s10489-022-03300-152:15(17708-17717)Online publication date: 4-Apr-2022
https://doi.org/10.1007/s10489-022-03300-1
Dobrowolski WNikodem MZawistowski MUnold O(2022)Improved Software Reliability Through Failure Diagnosis Based on Clues from Test and Production LogsNew Advances in Dependability of Networks and Systems10.1007/978-3-031-06746-4_5(42-49)Online publication date: 27-May-2022
https://doi.org/10.1007/978-3-031-06746-4_5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents