Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1629575.1629586acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Debugging in the (very) large: ten years of implementation and experience

Published: 11 October 2009 Publication History

Abstract

Windows Error Reporting (WER) is a distributed system that automates the processing of error reports coming from an installed base of a billion machines. WER has collected billions of error reports in ten years of operation. It collects error data automatically and classifies errors into buckets, which are used to prioritize developer effort and report fixes to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows developers to collect detailed information when needed. WER takes advantage of its scale to use error statistics as a tool in debugging; this allows developers to isolate bugs that could not be found at smaller scale. WER has been designed for large scale: one pair of database servers can record all the errors that occur on all Windows computers worldwide.

References

[1]
Apple Inc., CrashReporter. Technical Report TN2123, Cupertino, CA, 2004.
[2]
Ball, T., Bounimova, E., Cook, B., Levin, V., Lichtenberg, J., McGarvey, C., Ondrusek, B., Rajamani, S.K. and Ustuner, A. Thorough Static Analysis of Device Drivers. In Proc.of the EuroSys 2006 Conference, Leuven, Belgium, 2006.
[3]
Ball, T. and Rajamani, S.K. The SLAM Project: Debugging System Software via Static Analysis. In Proc.of the 29th ACM Symposium on Principles of Programming Languages, pp. 1--3, Portland, OR, 2002.
[4]
Berkman, J. Bug Buddy. Pittsburgh, PA, 1999, http://directory.fsf.org/project/bugbuddy/.
[5]
Broadwell, P., Harren, M. and Sastry, N. Scrash: A System for Generating Secure Crash Information. In Proc. of the 12th USENIX Security Symposium, pp. 273--284, Washington, DC, 2003.
[6]
Bush, W.R., Pincus, J.D. and Sielaff, D.J. A Static Analyzer for Finding Dynamic Programming Errors. Software-Practice and Experience, 30 (5), pp. 775--802, 2000.
[7]
Castro, M., Costa, M. and Martin, J.-P. Better Bug Reporting With Better Privacy. In Proc.of the 13th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, pp. 319--328, Seattle, WA, 2008.
[8]
Corbató, F.J. and Saltzer, J.H. Personal Correspondence. 2008.
[9]
Costa, M., Crowcroft, J., Castro, M., Rowstron, A., Zhou, L., Zhang, L. and Barham, P. Vigilante: End-to-End Containment of Internet Worms. In Proc. of the 20th ACM Symposium on Operating System Principles, pp. 133--147, Brighton, UK, 2005.
[10]
Das, M. Formal Specifications on Industrial-Strength Code -- From Myth to Reality. Invited Talk, Computer-Aided Verification, Seattle, WA, 2006.
[11]
Engler, D., Chen, D.Y., Hallem, S., Chou, A. and Chelf, B. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proc.of the 18th ACM Symposium on Operating Systems Principles, pp. 57--72, Alberta, Canada, 2001.
[12]
Everett, R.R. The Whirlwind I Computer. In Proc.of the 1951 Joint AIEE-IRE Computer Conference, pp. 70--74, Philadelphia, PA, 1951.
[13]
Ganapathi, A., Ganapathi, V. and Patterson, D., Windows XP Kernel Crash Analysis. In Proc.of the 20th Large Installation System Administration Conference, pp. 149--159, Washington, DC, 2006.
[14]
Ganapathi, A. and Patterson, D., Crash Data Collection: A Windows Case Study. In Proc.of the 2005 Intl. Conference on Dependable Systems and Networks, pp. 280--285, Yokohama, Japan, 2005.
[15]
Gkantsidis, C., Karagiannis, T., Rodrigeuz, P. and Vojnovic, M. Planet Scale Software Updates. In Proc.of ACM SIGCOMM 2006, Pisa, Italy, 2006.
[16]
Google Inc. Breakpad. Mountain View, CA, 2007, http://code.google.com/p/google-breakpad/.
[17]
Gray, J. Why Do Computers Stop and What Can We Do About It. In Proc. of the 6th Intl. Conference on Reliability and Distributed Databases, pp. 3--12, 1986.
[18]
Jula, H., Tralamazza, D., Zamfir, C. and Candea, G., Deadlock Immunity: Enabling Systems to Defend Against Deadlocks. In Proc. of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2008), pp. 295--308, San Diego, CA, 2008.
[19]
Lee, I. and Iyer, R.K., Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. In Digest of Paers of the Twenty-Third Intl. Symposium on Fault-Tolerant Computing (FTCS-23), Toulouse, France, 1993, IEEE.
[20]
Liblit, B., Aiken, A., Zheng, A.X. and Jordan, M.I. Bug Isolation via Remote Program Sampling. In Proc. of the 2003 Conference on Programming Language Design and Implementation, pp. 141--154, San Diego, CA, 2003.
[21]
Microsoft Corporation. DbgHelp Structures. In Microsoft Developer Network, Redmond, WA, 2001.
[22]
Microsoft Corporation. Debugging Tools for Windows. Redmond, WA, 2008, http://www.microsoft.com/whdc/devtools/debugging.
[23]
Microsoft Corporation. Plug and Play: Architecture and Driver Support. In Windows Hardware Developer Central, Redmond, WA, 2008.
[24]
Microsoft Corporation. Use The Microsoft Symbol Server to Obtain Debug Symbol Files. Knowledge Base Article 311503, Redmond, WA, 2006.
[25]
Mozilla Foundation. Talkback. Mountain View, CA, 2003, http://talkback.mozilla.org.
[26]
Murphy, B. Automating Software Failure Recovery. ACM Queue, 2 (8), pp. 42--48, 2004.
[27]
Portokalidis, G., Slowinska, A. and Bos, H. Argos: An Emulator for Fingerprinting Zero-day Attacks for Advertised Honeypots with Automatic Signature Generation. In Proc.of the EuroSys 2006 Conference, pp. 15--27, Leuven, Belgium, 2006.
[28]
Qin, F., Tucek, J., Sundaresan, J. and Zhou, Y. Rx: Treating Bugs as Allergies--A Safe Method to Survive Software Failure. In Proc. of the 20th ACM Symposium on Operating System Principles, Brighton, UK, 2005.
[29]
Rinard, M., Cadar, C., Dumitran, D., Roy, D.M., Leu, T. and Beebee, W.S., Jr. Enhancing Server Availability and Security Through Failure-Oblivious Computing. In Proc. of the 6th Symposium on Operating Systems Design and Implementation San Francisco, CA, 2004.
[30]
Rochlis, J.A. and Eichin, M.W. With Microscope and Tweezers: The Worm from MIT's Perspective. Communications of the ACM, 32 (6), pp. 689--698, 1989.
[31]
Tucek, J., Lu, S., Huang, C., Xanthos, S. and Zhou, Y. Triage: Diagnosing Production Run Failures at the User's Site In Proc. of the 21st ACM SIGOPS Symposium on Operating Systems Principles, pp. 131--144, Stevenson, WA, 2007.
[32]
Walter, E.S. and Wallace, V.L. Further Analysis of a Computing Center Environment. Communications of the ACM, 10 (5), pp. 266--272, 1967.

Cited By

View all
  • (2025)Architecture 2.0: Foundations of Artificial Intelligence Agents for Modern Computer System DesignComputer10.1109/MC.2024.352164158:2(116-124)Online publication date: Feb-2025
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)PP-CSA: Practical Privacy-Preserving Software Call Stack AnalysisProceedings of the ACM on Programming Languages10.1145/36498568:OOPSLA1(1264-1293)Online publication date: 29-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
October 2009
346 pages
ISBN:9781605587523
DOI:10.1145/1629575
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. blue screen of death
  2. bucketing
  3. classifying
  4. error reports
  5. labeling
  6. minidump
  7. statistics-based debugging.

Qualifiers

  • Research-article

Conference

SOSP09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Architecture 2.0: Foundations of Artificial Intelligence Agents for Modern Computer System DesignComputer10.1109/MC.2024.352164158:2(116-124)Online publication date: Feb-2025
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)PP-CSA: Practical Privacy-Preserving Software Call Stack AnalysisProceedings of the ACM on Programming Languages10.1145/36498568:OOPSLA1(1264-1293)Online publication date: 29-Apr-2024
  • (2024)Measurement‐Based Analysis of Large‐Scale ClustersDependable Computing10.1002/9781119743453.ch12(585-665)Online publication date: 26-Apr-2024
  • (2023)Alligator in Vest: A Practical Failure-Diagnosis Framework via Arm Hardware FeaturesProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598106(917-928)Online publication date: 12-Jul-2023
  • (2023)Hacksaw: Hardware-Centric Kernel Debloating via Device Inventory and Dependency AnalysisProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623208(1994-2008)Online publication date: 15-Nov-2023
  • (2023)Adaptive Tracing and Fault Injection based Fault Diagnosis for Open Source Server Software2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)10.1109/QRS60937.2023.00076(729-740)Online publication date: 22-Oct-2023
  • (2022)The case for an internet primitive for fault localizationProceedings of the 21st ACM Workshop on Hot Topics in Networks10.1145/3563766.3564105(160-166)Online publication date: 14-Nov-2022
  • (2022)DeepAnalyzeProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3512759(549-560)Online publication date: 21-May-2022
  • (2022)BuildSheriffProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510132(312-324)Online publication date: 21-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media