research-article

SherLog: error diagnosis by connecting clues from run-time logs

Authors:

Shankar PasupathyAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 38, Issue 1

Pages 143 - 154

https://doi.org/10.1145/1735970.1736038

Published: 13 March 2010 Publication History

Abstract

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users' inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors.

Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called SherLog, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log's semantics. It infers both control and data value information regarding to the failed execution.

We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.

References

[1]

H. Agrawal, R. A. DeMillo, and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience, 23(6):589--616, June 1993.

Digital Library

[2]

H. Agrawal, J. R. Horgan, S. London, and W. E.Wong. Fault localization using execution slices and dataflow tests. In ISSRE'95.

[3]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP'03.

Digital Library

[4]

A. Aiken, S. Bugrara, I. Dillig, T. Dillig, P. Hawkins, and B. Hackett. The Saturn Program Analysis System.

[5]

K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In SP '02: Proceedings of the 2002 IEEE Symposium on Security and Privacy.

Digital Library

[6]

A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. Traceback: First fault diagnosis by reconstruction of distributed control flow. In PLDI'05.

Digital Library

[7]

T. Ball, M. Naik, and S. K. Rajamani. From symptom to cause: localizing errors in counterexample traces. ACM SIGPLAN Notices, 38(1):97--105, Jan. 2003.

Digital Library

[8]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04.

Digital Library

[9]

E. Bodden, P. Lam, and L. Hendren. Finding programming errors earlier by evaluating runtime monitors ahead-of-time. In FSE'08.

Digital Library

[10]

C. Cadar, D. Dunbar, and D. R. Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI'08.

Digital Library

[11]

F. Chen and G. Rosú. Parametric trace slicing and monitoring. In TACAS'09.

Digital Library

[12]

T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. HOLMES: Effective statistical debugging via efficient path profiling. In ICSE'09.

Digital Library

[13]

V. Chipounov, V. Georgescu, C. Zamfir, and G. Candea. Selective Symbolic Execution. In HotDep'09.

[14]

I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP'05.

Digital Library

[15]

Dell. Streamlined Troubleshooting with the Dell system E--Support tool. Dell Power Solutions, 2008.

[16]

R. A. DeMillo, H. Pan, and E. H. Spafford. Critical slicing for software fault localization. In ISSTA, pages 121--134, 1996.

Digital Library

[17]

J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS'09.

Digital Library

[18]

I. Dillig, T. Dillig, and A. Aiken. Sound, complete and scalable pathsensitive analysis. SIGPLAN Not., 2008.

Digital Library

[19]

G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In VEE'08.

Digital Library

[20]

D. Engler, B. Chelf, and A. Chou. Checking system rules using system--specific, programmer--written compiler extensions. In OSDI'00.

Digital Library

[21]

K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL'08.

Digital Library

[22]

K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)large: ten years of implementation and experience. In SOSP'09, pages 103--116, New York, NY, USA, 2009. ACM.

Digital Library

[23]

J. Gray. Why do computers stop and what can be done about it?, 1985.

[24]

Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M.Wu,M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI'08.

Digital Library

[25]

R. Gupta, M. L. Soffa, and J. Howard. Hybrid slicing: integrating dynamic information with static analysis. ACMTransactions on Software Engineering and Methodology, 6(4):370--397, Oct. 1997.

Digital Library

[26]

S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. In PLDI '88.

Digital Library

[27]

W. Jiang. Understanding storage system problems and diagnosing them through log analysis. Ph.D. Dissertation.

Digital Library

[28]

W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In FAST'09.

Digital Library

[29]

S. Kandula, R. Mahajan, P. Verkaik, S. Agrawal, J. Padhye, and P. Bahl. Degailed diagnosis in enterprise networks. In SIGCOMM'09.

Digital Library

[30]

S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In USENIX ATC'05.

Digital Library

[31]

B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI'03.

Digital Library

[32]

Apache Logging Services -- Log4j. http://logging.apache.org/log4j.

[33]

R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Softw. Eng. Notes, 29(6):63--72, 2004.

Digital Library

[34]

Mozilla Quality Feedback Agent. http://support.mozilla.com/en-US/kb/quality+feedback+agent.

[35]

S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS'06.

Digital Library

[36]

S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA'05.

Digital Library

[37]

NetApp. Proactive health management with auto-support. NetApp White Paper, 2007.

[38]

M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS'09.

Digital Library

[39]

Squid Archives. http://www.squid-cache.org/Versions/v2/2.3/bugs/#squid-2.3.stable4-ftp_icon_not_found.

[40]

M. Sridharan, S. J. Fink, and R. Bodik. Thin slicing. In PLDI'07.

Digital Library

[41]

F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3:121--189, 1995.

[42]

J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP'07.

Digital Library

[43]

VMWare. Using the intergrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.

[44]

A. Whitaker, R. S. Cox, and S. D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI'04.

Digital Library

[45]

Windows Error Reporting(Dr.Watson). http://www.microsoft.com/whdc/maintain/StartWER.mspx.

[46]

M. Xu, R. Bodik, and M. D. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA'03.

Digital Library

[47]

W. Xu, L. Huang,M. Jordan, D. Patterson, and A. Fox. Mining console logs for large-scale system problem detection. In SOSP'09.

[48]

J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI'04.

Digital Library

[49]

Y.Xie and A.Aiken. Saturn: A scalable framework for error detection using boolean satisfiability. Transactions on Programming Language and Systems, 29(3):1---16, 2007.

Digital Library

[50]

A. Zeller. Isolating cause-effect chains from computer programs. In FSE'02.

Digital Library

Cited By

Ma MLin WPan DWang P(2022)Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice ArchitectureIEEE Transactions on Services Computing10.1109/TSC.2020.299325115:3(1399-1410)Online publication date: 1-May-2022
https://doi.org/10.1109/TSC.2020.2993251
Zhu JYang RSun XWo THu CPeng HXiao JZomaya AXu J(2022)QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320249333:12(4818-4834)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3202493
Meng ZXiong YHuang WMiao FHuang J(2021)AppAngio: Revealing Contextual Information of Android App Behaviors by API-Level Audit LogsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2020.304486716(1912-1927)Online publication date: 8-Jan-2021
https://dl.acm.org/doi/10.1109/TIFS.2020.3044867
Show More Cited By

Index Terms

SherLog: error diagnosis by connecting clues from run-time logs
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS '10

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
Improving Software Diagnosability via Log Enhancement
Special Issue APLOS 2011

Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of troubleshooting any complex software system, but further exacerbated by the paucity of information that is typically available in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 38, Issue 1

ASPLOS '10

March 2010

399 pages

ISSN:0163-5964

DOI:10.1145/1735970

Issue’s Table of Contents

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
March 2010
422 pages
ISBN:9781605588391
DOI:10.1145/1736020
General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Published in SIGARCH Volume 38, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

270
Total Citations
View Citations
1,541
Total Downloads

Downloads (Last 12 months)118
Downloads (Last 6 weeks)17

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma MLin WPan DWang P(2022)Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice ArchitectureIEEE Transactions on Services Computing10.1109/TSC.2020.299325115:3(1399-1410)Online publication date: 1-May-2022
https://doi.org/10.1109/TSC.2020.2993251
Zhu JYang RSun XWo THu CPeng HXiao JZomaya AXu J(2022)QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320249333:12(4818-4834)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3202493
Meng ZXiong YHuang WMiao FHuang J(2021)AppAngio: Revealing Contextual Information of Android App Behaviors by API-Level Audit LogsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2020.304486716(1912-1927)Online publication date: 8-Jan-2021
https://dl.acm.org/doi/10.1109/TIFS.2020.3044867
Jia TWu YHou CLi Y(2021)LogFlash: Real-time Streaming Anomaly Detection and Diagnosis from System Logs for Large-scale Software Systems2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00021(80-90)Online publication date: Oct-2021
https://doi.org/10.1109/ISSRE52982.2021.00021
Cavallaro CRonchieri E(2021)Identifying Anomaly Detection Patterns from Log Files: A Dynamic ApproachComputational Science and Its Applications – ICCSA 202110.1007/978-3-030-86960-1_36(517-532)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-86960-1_36
You CWang QSun C(2021)sBiLSAN: Stacked Bidirectional Self-attention LSTM Network for Anomaly Detection and Diagnosis from System LogsIntelligent Systems and Applications10.1007/978-3-030-82199-9_52(777-793)Online publication date: 7-Aug-2021
https://doi.org/10.1007/978-3-030-82199-9_52
Notaro PCardoso JGerndt M(2021)A Systematic Mapping Study in AIOpsService-Oriented Computing – ICSOC 2020 Workshops10.1007/978-3-030-76352-7_15(110-123)Online publication date: 30-May-2021
https://doi.org/10.1007/978-3-030-76352-7_15
Lal SSardana NSureka A(2020)Improving Logging Prediction on Imbalanced DatasetsCognitive Analytics10.4018/978-1-7998-2460-2.ch039(740-772)Online publication date: 2020
https://doi.org/10.4018/978-1-7998-2460-2.ch039
KUBOTA TAOTA NKONO K(2020)Logging Inter-Thread Data Dependencies in Linux KernelIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7255E103.D:7(1633-1646)Online publication date: 1-Jul-2020
https://doi.org/10.1587/transinf.2019EDP7255
Zhang LLi ZFeng YZhang ZChan WZhang JZhou Y(2020)Improving Fault-Localization Accuracy by Referencing Debugging History to Alleviate Structure Bias in Code SuspiciousnessIEEE Transactions on Reliability10.1109/TR.2020.298297569:3(1021-1049)Online publication date: Sep-2020
https://doi.org/10.1109/TR.2020.2982975
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents