Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Sean Blanchard

    In this work we present SaNSA, the Supercomputer and Node State Architecture, a software infrastructure for historical analysis and anomaly detection. SaNSA consumes data from multiple sources including system logs, the resource manager,... more
    In this work we present SaNSA, the Supercomputer and Node State Architecture, a software infrastructure for historical analysis and anomaly detection. SaNSA consumes data from multiple sources including system logs, the resource manager, scheduler, and job logs. Furthermore, additional context such as scheduled maintenance events or dedicated application run times for specific science teams can be overlaid. We discuss how this contextual information allows for more nuanced analysis. SaNSA allows the user to apply arbitrary attributes, for instance, positional information where nodes are located in a data center. We show how using this information we identify anomalous behavior of one rack of a 1,500 node cluster. We explain the design of SaNSA and then test it on four open compute clusters at LANL. We ingest over 1.1 billion lines of system logs in our study of 190 days in 2018. Using SaNSA, we perform a number of different anomaly detection methods and explain their findings in the context of a production supercomputing data center. For example, we report on instances of misconfigured nodes which receive no scheduled jobs for a period of time as well as examples of correlated rack failures which cause jobs to crash.
    ABSTRACT As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the... more
    ABSTRACT As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of today have to deal with these faults to a small degree and it is expected this will only be a larger problem as systems continue to scale. In this paper we present SEFI, the Soft Error Fault Injection framework, a tool for profiling software for its susceptibility to soft errors. In particular, we focus in this paper on logic soft error injection. Using the open source virtual machine and processor emulator (QEMU), we demonstrate modifying emulated machine instructions to introduce soft errors. We conduct experiments by modifying the virtual machine itself in a way that does not require intimate knowledge of the tested application. With this technique, we show that we are able to inject simulated soft errors in the logic operations of a target application without affecting other applications or the operating system sharing the VM. We present some initial results and discuss where we think this work will be useful in next generation hardware/software co-design.
    ABSTRACT Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems... more
    ABSTRACT Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software faults and environmental factors. Autonomic anomaly detection is crucial for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect anomalous cloud behaviors, we need to monitor the cloud execution and collect runtime cloud performance data. For different types of failures, the data display different correlations with the performance metrics. In this paper, we present a wavelet-based multi-scale anomaly identification mechanism, that can analyze profiled cloud performance metrics in both time and frequency domains and identify anomalous cloud behaviors. Learning technologies are exploited to adapt the selection of mother wavelets and a sliding detection window is employed to handle cloud dynamicity and improve anomaly detection accuracy. We have implemented a prototype of the anomaly identification system and conducted experiments on an on-campus cloud computing environment. Experimental results show the proposed mechanism can achieve 93.3% detection sensitivity while keeping the false positive rate as low as 6.1% while outperforming other tested anomaly detection schemes.
    We discuss observed characteristics of GPUs deployed as accelerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpensive and available, but they are relatively... more
    We discuss observed characteristics of GPUs deployed as accelerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpensive and available, but they are relatively new to HPC, which demands both consistently high performance across nodes and consistently low error rate.
    Fault tolerance poses a major challenge for future large-scale systems. Current research on fault tolerance has been principally focused on mitigating the impact of uncorrectable errors: errors that corrupt the state of the machine and... more
    Fault tolerance poses a major challenge for future large-scale systems. Current research on fault tolerance has been principally focused on mitigating the impact of uncorrectable errors: errors that corrupt the state of the machine and require a restart from a known good state. However, correctable errors occur much more frequently than uncorrectable errors and may be even more common on future systems. Although an application can safely continue to execute when correctable errors occur, recovery from a correctable error requires the error to be corrected and, in most cases, information about its occurrence to be logged. The potential performance impact of these recovery activities has not been extensively studied in HPC. In this paper, we use simulation to examine the relationship between recovery from correctable errors and application performance for several important extreme-scale workloads. Our paper contains what is, to the best of our knowledge, the first detailed analysis of the impact of correctable errors on application performance. Our study shows that correctable errors can have significant impact on application performance for future systems. We also find that although the focus on correctable errors is focused on reducing failure rates, reducing the time required to log individual errors may have a greater impact on overheads at scale. Finally, this study outlines the error frequency and durations targets to keep correctable overheads similar to that of today’s systems. This paper provides critical analysis and insight into the overheads of correctable errors and provides practical advice to systems administrators and hardware designers in an effort to fine-tune performance to application and system characteristics.
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs... more
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. In this paper, we present a System Log Event Block Detection (SLEBD) framework that identifies groups of log messages that follow certain sequence but with variations, and explore these event blocks for event-based system behavior analysis and anomaly detection. Compared with the existing approaches that analyze system logs line by line, SLEBD is capable of characterizing system behavior and identifying intricate anomalies at a higher (i.e., event) level. We evaluate the performance of SLEBD by using syslogs collected from production supercomputers. Experimental results show that our framework and mechanisms can process streaming log messages, efficiently extract event blocks and effectively detect anomalies, which enables system administrators and monitoring tools to understand and process system events in real time. Additionally, we use the identified event blocks and explore deep learning algorithms to model and classify event sequences.
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs... more
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. Sequence mining technologies aim to identify important patterns among a set of objects, which can help us discover regularity among events, detect anomalies, and predict events in HPC environments. The existing sequence mining algorithms are compute-intensive and inefficient to process the overwhelming number of system events which have complex interaction and dependency. In this paper, we present a novel, topology-aware sequence mining method (named TSM) and explore it for event analysis and anomaly detection on production HPC systems. TSM is resource-efficient and capable of producing long and complex event patterns from log messages, which makes TSM suitable for online monitoring and diagnosing of large-scale systems. We evaluate the performance of TSM using system logs collected from a production supercomputer. Experimental results show that TSM is highly efficient in identifying event sequences on single and multiple nodes without any prior knowledge. We apply verification functions and requirements and prove the correctness of the event patterns produced by TSM.
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs... more
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. In this paper, we present a System Log Event Block Detection (SLEBD) framework that identifies groups of log messages that follow certain sequence but with variations, and explore these event blocks for event-based system behavior analysis and anomaly detection. Compared with the existing approaches that analyze system logs line by line, SLEBD is capable of characterizing system behavior and identifying intricate anomalies at a higher (i.e., event) level. We evaluate the performance of SLEBD by using syslogs collected from production supercomputers. Experimental results show that our framework and mechanisms can process streaming log messages, efficiently extract event blocks and effectively detect anomalies, which enables system administrators and monitoring tools to understand and process system events in real time. Additionally, we use the identified event blocks and explore deep learning algorithms to model and classify event sequences.
    Soft errors are becoming an important issue in computing systems. Near threshold voltage (NTV), reduced circuit sizes, high performance computing (HPC), and high altitude computing all present interesting challenges in this area. Much of... more
    Soft errors are becoming an important issue in computing systems. Near threshold voltage (NTV), reduced circuit sizes, high performance computing (HPC), and high altitude computing all present interesting challenges in this area. Much of the existing literature has focused on hardware techniques to mitigate and measure soft errors at the hardware level. Instead, in this paper we explore the soft error susceptibility of three common sorting algorithms at the software layer. We focus on the comparison operator and use our software fault injection tool to place faults with fine precision during the execution of these algorithms. We explore how the algorithm susceptibilities vary based on input and bit position and relate these faults back to the source code to study how algorithmic decisions impact the reliability of the codes. Finally, we look at the question of the number of fault injections required for statistical significance. Using standard practice equations used in hardware fault injection experiments we calculate the number of injections that should be required to achieve confidence in our results. Then we show, empirically, that more fault injections are required before we gain confidence in our experiments.
    Environmental sensors monitor supercomputing facility health, generating massive data in the largest facilities. Current state-of-the-art is for human operators to evaluate environmental data by hand. This approach will not be viable on... more
    Environmental sensors monitor supercomputing facility health, generating massive data in the largest facilities. Current state-of-the-art is for human operators to evaluate environmental data by hand. This approach will not be viable on Exascale machines, nor is it ideal on current systems. We evaluate effectiveness of the DBSCAN algorithm for identifying anomalies in supercomputing sensor data. We filter large portions of data showing normal behavior from anomalies, and then rank anomalous points by distance to the nearest normal cluster. We compare DBSCAN to k-means and Gaussian kernel density estimation, finding that DBSCAN effectively clusters sensor data from a Cray supercomputing facility. DBSCAN also successfully clusters synthetic injected data, avoiding the false positives generated by k-means and Gaussian kernel density estimation.
    ABSTRACT Chip kill correct is an advanced type of error correction used in memory sub-systems. Existing analytical approaches for modeling the reliability of memory sub-systems with chipkillcorrect are limited to those with chip... more
    ABSTRACT Chip kill correct is an advanced type of error correction used in memory sub-systems. Existing analytical approaches for modeling the reliability of memory sub-systems with chipkillcorrect are limited to those with chip kill-correct solutions that guarantee correction of errors in a single DRAM device. However, stronger chip kill correct solutions that are capable of guaranteeing the detection and even correction of errors in up to two DRAM devices have become common in existing HPC systems. Analytical reliability models are needed for such memory subsystems. This paper proposes analytical models for the reliability of double-chipkill detect and/or correct. Validation against Monte Carlo simulations shows that the output of our analytical models are within 3.9% of Monte Carlo simulations, on average. We used the analytical models to study various aspects of the reliability of memory sub-systems protected by double-chip kill detect and/or correct. Our studies provide several insights into the dependence of reliability of these systems on scale, device fault rate, memory organization, and memory-scrubbing policy.
    ABSTRACT As the high performance computing (HPC) com-munity continues to push towards exascale computing, re-silience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a... more
    ABSTRACT As the high performance computing (HPC) com-munity continues to push towards exascale computing, re-silience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault In-jector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without re-quiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.
    ... The authors thank Joe Abeyta, Chuck Alexander, Ben Bergen, Ann Borrett, Henry Brandt, James Campa, Randy Cardon, Nathan DeBardeleben, Tom Fairbanks, Parks Fields, Alan Gibson, Gary Grider, Josip Lon-caric, Pablo Lujan, Alex Malin,... more
    ... The authors thank Joe Abeyta, Chuck Alexander, Ben Bergen, Ann Borrett, Henry Brandt, James Campa, Randy Cardon, Nathan DeBardeleben, Tom Fairbanks, Parks Fields, Alan Gibson, Gary Grider, Josip Lon-caric, Pablo Lujan, Alex Malin, Fred ... [21] Z. Kalbarczyk, R. Iyer ...
    Future exascale application programmers and users have a need to quantity an application's resilience and vulnerability to soft errors before running their codes on production supercomputers due to the cost of failures and hazards... more
    Future exascale application programmers and users have a need to quantity an application's resilience and vulnerability to soft errors before running their codes on production supercomputers due to the cost of failures and hazards from silent data corruption. Barring a deep understanding of the resiliency of a particular application, vulnerability evaluation is commonly done through fault injection tools at either the software or hardware level. Hardware fault injection, while most realistic, is relegated to customized vendor chips and usually applications cannot be evaluated at scale. Software fault injection can be done more practically and efficiently and is the approach that many researchers use as a reasonable approximation. With a sufficiently sophisticated software fault injection framework, an application can be studied to see how it would handle many of the errors that manifest at the application level. Using such a tool, a developer can progressively improve the resilience at targeted locations they believe are important for their target hardware.
    Cosmic rays causing faults on computers is well-established within the reliability community. However, during Solar Proton Events (SPEs), the Earth's magnetic field compresses, which can shield the Earth from the effects of cosmic... more
    Cosmic rays causing faults on computers is well-established within the reliability community. However, during Solar Proton Events (SPEs), the Earth's magnetic field compresses, which can shield the Earth from the effects of cosmic rays. In our paper, we use a statistical analysis to quantitatively assess if differing flux levels from SPEs lead to significant changes in the number of faults observed on Cielo, a supercomputer at Los Alamos National Laboratory. From our analysis, we found that as flux levels increase during SPEs, there is an overall decrease in the number of faults on Cielo. Better understanding about how SPEs affect fault rates allows the high performance computing reliability community to more accurately compare cosmic ray induced faults from different time periods.
    Monitoring high performance computing systems has become increasingly difficult as researchers and system analysts face the challenge of synthesizing a wide range of monitoring information in order to detect system problems on ever larger... more
    Monitoring high performance computing systems has become increasingly difficult as researchers and system analysts face the challenge of synthesizing a wide range of monitoring information in order to detect system problems on ever larger machines. We present a method for anomaly detection on syslog data, one of the most important data streams for determining system health. Syslog messages pose a difficult question for analysis because they include a mix of structured natural language text as well as numeric values. We present an anomaly detection framework that combines graph analysis, relational learning, and kernel density estimation to detect unusual syslog messages. We design an event block detector, which finds groups of related syslog messages, to retrieve the entire section of syslog messages associated with a single anomalous line. Our novel approach successfully retrieves anomalous behaviors inserted into syslog files from a virtual machine, including messages indicating serious system problems. We also test our approach on syslog messages from the Trinity supercomputer and find that our methods do not generate significant false positives.
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs... more
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. In this paper, we present a System Log Event Block Detection (SLEBD) framework that identifies groups of log messages that follow certain sequence but with variations, and explore these event blocks for event-based system behavior analysis and anomaly detection. Compared with the existing approaches that analyze system logs line by line, SLEBD is capable of characterizing system behavior and identifying intricate anomalies at a higher (i.e., event) level. We evaluate the performance of SLEBD by using syslogs collected from production supercomputers. Experimental results show that our framework and mechanisms can process streaming log messages, efficiently extract event blocks and effectively detect anomalies, which enables system administrators and monitoring tools to understand and process system events in real time. Additionally, we use the identified event blocks and explore deep learning algorithms to model and classify event sequences.
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs... more
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. Sequence mining technologies aim to identify important patterns among a set of objects, which can help us discover regularity among events, detect anomalies, and predict events in HPC environments. The existing sequence mining algorithms are compute-intensive and inefficient to process the overwhelming number of system events which have complex interaction and dependency. In this paper, we present a novel, topology-aware sequence mining method (named TSM) and explore it for event analysis and anomaly detection on production HPC systems. TSM is resource-efficient and capable of producing long and complex event patterns from log messages, which makes TSM suitable for online monitoring and diagnosing of large-scale systems. We evaluate the performance of TSM using system logs collected from a production supercomputer. Experimental results show that TSM is highly efficient in identifying event sequences on single and multiple nodes without any prior knowledge. We apply verification functions and requirements and prove the correctness of the event patterns produced by TSM.
    Cosmic rays causing faults on computers is well-established within the reliability community. However, during Solar Proton Events (SPEs), the Earth's magnetic field compresses, which can shield the Earth from the effects of cosmic... more
    Cosmic rays causing faults on computers is well-established within the reliability community. However, during Solar Proton Events (SPEs), the Earth's magnetic field compresses, which can shield the Earth from the effects of cosmic rays. In our paper, we use a statistical analysis to quantitatively assess if differing flux levels from SPEs lead to significant changes in the number of faults observed on Cielo, a supercomputer at Los Alamos National Laboratory. From our analysis, we found that as flux levels increase during SPEs, there is an overall decrease in the number of faults on Cielo. Better understanding about how SPEs affect fault rates allows the high performance computing reliability community to more accurately compare cosmic ray induced faults from different time periods.
    Future exascale application programmers and users have a need to quantity an application's resilience and vulnerability to soft errors before running their codes on production supercomputers due to the cost of failures and hazards... more
    Future exascale application programmers and users have a need to quantity an application's resilience and vulnerability to soft errors before running their codes on production supercomputers due to the cost of failures and hazards from silent data corruption. Barring a deep understanding of the resiliency of a particular application, vulnerability evaluation is commonly done through fault injection tools at either the software or hardware level. Hardware fault injection, while most realistic, is relegated to customized vendor chips and usually applications cannot be evaluated at scale. Software fault injection can be done more practically and efficiently and is the approach that many researchers use as a reasonable approximation. With a sufficiently sophisticated software fault injection framework, an application can be studied to see how it would handle many of the errors that manifest at the application level. Using such a tool, a developer can progressively improve the resilience at targeted locations they believe are important for their target hardware.
    The numerical mode-matching (NMM) method as an efficient algorithm has previously been used to model various multi-region vertically and cylindrically stratified inhomogeneous media. It has been shown that the NMM method is more efficient... more
    The numerical mode-matching (NMM) method as an efficient algorithm has previously been used to model various multi-region vertically and cylindrically stratified inhomogeneous media. It has been shown that the NMM method is more efficient than direct use of the finite element method (FEM) to solve these problems. However the applications of the NMM method have been limited to two-dimensional (2-D)
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs... more
    System logs provide invaluable resources for understanding system behavior and detecting anomalies on high performance computing (HPC) systems. As HPC systems continue to grow in both scale and complexity, the sheer volume of system logs and the complex interaction among system components make the traditional manual problem diagnosis and even automated line-by-line log analysis infeasible or ineffective. In this paper, we present a System Log Event Block Detection (SLEBD) framework that identifies groups of log messages that follow certain sequence but with variations, and explore these event blocks for event-based system behavior analysis and anomaly detection. Compared with the existing approaches that analyze system logs line by line, SLEBD is capable of characterizing system behavior and identifying intricate anomalies at a higher (i.e., event) level. We evaluate the performance of SLEBD by using syslogs collected from production supercomputers. Experimental results show that our framework and mechanisms can process streaming log messages, efficiently extract event blocks and effectively detect anomalies, which enables system administrators and monitoring tools to understand and process system events in real time. Additionally, we use the identified event blocks and explore deep learning algorithms to model and classify event sequences.
    The high performance, high efficiency, and low cost of Commercial Off-The-Shelf (COTS) devices make them attractive also for applications with strict reliability constraints. Today, COTS devices are adopted in HPC and safety-critical... more
    The high performance, high efficiency, and low cost of Commercial Off-The-Shelf (COTS) devices make them attractive also for applications with strict reliability constraints. Today, COTS devices are adopted in HPC and safety-critical applications such as autonomous driving. Unfortunately, we cannot assume that COTS chips manufacturing process does not include the cheap natural Boron, that could makes them highly susceptible to thermal (low energy) neutrons. Through radiation beam experiments, using high-energy and low-energy neutrons, it has been shown that thermal neutrons are a significant threat to COTS device reliability. The evaluation includes AMD APU, three different NVIDIA GPUs, an Intel accelerator, and an FPGA executing a relevant set of algorithms. Besides the sensitivity of the devices to thermal neutrons it is also fundamental to consider the thermal neutron flux in different scenarios such as weather, concrete walls and floors, or even HPC liquid cooling systems. Correlating beam experiments and neutron detector data, it is shown that thermal neutrons FIT rate could be comparable or even higher than the high energy neutron FIT rate.
    As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that... more
    As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.
    With the ever-growing scaling of computing capability, computing systems like supercomputers and embedded systems are bounded by limited power nowadays. Upon the mutually constrained nature between power efficiency and resilience,... more
    With the ever-growing scaling of computing capability, computing systems like supercomputers and embedded systems are bounded by limited power nowadays. Upon the mutually constrained nature between power efficiency and resilience, trade-offs of them have been extensively studied for achieving the optimal performance-power ratio, either under a certain power cap, or within the requirement of quality metrics of applications. Theoretically, running programs in the low-power mode of computational components (e.g., CPU/GPU) can lead to increasing on-chip failure rates in terms of register-level susceptibility to soft errors. However, experimentally, such errors may not arise due to register vulnerability - errors occur at non-vulnerable register access intervals are invalidated and thus will not propagate to later execution. In this work, leveraging register vulnerability, we investigate the validity of failure rates in computing systems at Near-Threshold Voltage (NTV), and empirically e...
    Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are... more
    Supercomputers, high performance computers, and clusters are composed of very large numbers of independent operating systems that are generating their own system logs. Messages are generated locally on each host and usually are transferred to a central logging infrastructure which keeps a master record of the system as a whole. At Los Alamos National Laboratory (LANL) a collection of open source cloud tools are used which log over a hundred million system log messages per day from over a dozen such systems. Understanding what source code created those messages can be extremely useful to system administrators when they are troubleshooting these complex systems as it can give insight into a subsystem (disk, network, etc.) or even line numbers of source code. Oftentimes, debugging supercomputers is done in environments where open access cannot be provided to all individuals due to security concerns. As such, providing a means for conveying information between system log messages and source code lines allows for communication between system administrators and source developers or supercomputer vendors. In this work, we demonstrate a prototype tool which aims to provide such an expert system. We leverage capabilities from ElasticSearch, one of the open source cloud tools deployed at LANL, and with our own metrics develop a means for correctly matching source code lines as well as files with high confidence. We discuss confidence metrics and show that in our experiments 92% of syslog lines were correctly matched. For any future samples, we predict with 95% confidence that the correct file will be detected between 88.2% and 95.8% of the time. Finally, we discuss enhancements that are underway to improve the tool and study it on a larger dataset.
    Fault tolerance poses a major challenge for future large-scale systems. Current research on fault tolerance has been principally focused on mitigating the impact of uncorrectable errors: errors that corrupt the state of the machine and... more
    Fault tolerance poses a major challenge for future large-scale systems. Current research on fault tolerance has been principally focused on mitigating the impact of uncorrectable errors: errors that corrupt the state of the machine and require a restart from a known good state. However, correctable errors occur much more frequently than uncorrectable errors and may be even more common on future systems. Although an application can safely continue to execute when correctable errors occur, recovery from a correctable error requires the error to be corrected and, in most cases, information about its occurrence to be logged. The potential performance impact of these recovery activities has not been extensively studied in HPC. In this paper, we use simulation to examine the relationship between recovery from correctable errors and application performance for several important extreme-scale workloads. Our paper contains what is, to the best of our knowledge, the first detailed analysis of the impact of correctable errors on application performance. Our study shows that correctable errors can have significant impact on application performance for future systems. We also find that although the focus on correctable errors is focused on reducing failure rates, reducing the time required to log individual errors may have a greater impact on overheads at scale. Finally, this study outlines the error frequency and durations targets to keep correctable overheads similar to that of today’s systems. This paper provides critical analysis and insight into the overheads of correctable errors and provides practical advice to systems administrators and hardware designers in an effort to fine-tune performance to application and system characteristics.
    This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are... more
    This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance. The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. ...

    And 23 more