Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Memory Errors in Modern Systems

2015, Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Memory Errors in Modern Systems The Good, The Bad, and The Ugly Vilas Sridharan1 , Nathan DeBardeleben2 , Sean Blanchard2 , Kurt B. Ferreira3 , Jon Stearley3 , John Shalf4 , Sudhanva Gurumurthi5 1 RAS Architecture, 5 AMD Research, Advanced Micro Devices, Inc., Boxborough, MA Ultrascale Systems Research Center, Los Alamos National Laboratory, Los Alamos, New Mexico ∗ 3 Scalable System Software, Sandia National Laboratories, Albuquerque, New Mexico † 4 National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA {vilas.sridharan, sudhanva.gurumurthi}@amd.com, {ndebard, seanb}@lanl.gov {kbferre, jrstear}@sandia.gov, jshalf@lbl.gov 2 Abstract Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently∗A portion of this work was performed at the Ultrascale Systems Research Center (USRC) at Los Alamos National Laboratory, supported by the U.S. Department of Energy contract DE-FC02-06ER25750. The publication has been assigned the LANL identifier LA-UR-14-26219. † Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000. The publication has been assigned the Sandia identifier SAND2014-16515J ‡ A portion of this work used resources of the National Energy Research Scientific Computing Center supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ASPLOS ’15, March 14–18, 2015, Istanbul, Turkey. Copyright c 2015 ACM 978-1-4503-2835-7/15/03. . . $15.00. http://dx.doi.org/10.1145/2694344.2694348 ‡ deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today’s systems. Categories and Subject Descriptors B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance Keywords Field studies; Large-scale systems; Reliability 1. Introduction Current predictions are that exascale systems in the early 2020s will have between 32 and 100 petabytes of main memory (DRAM), a 100x to 350x increase compared to 2012 levels [8]. Similar increases are likely in the amount of cache memory (SRAM) in such systems. Future data centers will also contain many more nodes than existing data centers. These systems and data centers will require significant increases in the reliability of both DRAM and SRAM memories in order to maintain hardware failure rates comparable to current systems. The focus of this work is to provide insight and guidance to system designers and operators on characteristics of a reliable system. In particular, our focus is on analyzing the efficacy of existing hardware resilience techniques, their impact on reliable system design, and whether they will be adequate for future systems. For our analysis, we use data collected over the past several years from two leadership-class production sys- tems: Hopper, a 6,000-node supercomputer located at the NERSC center at Lawrence Berkeley Labs in Oakland, California; and Cielo, an 8,500-node supercomputer located at Los Alamos National Laboratory (LANL) in Los Alamos, New Mexico. Both supercomputers are based on AMD OpteronTM CPUs and contain DDR3 DRAM. In aggregate, the data that we analyze (which is a subset of the full data available) comprises over 314 million CPU socket-hours and 45 billion DRAM device-hours. This scale gives us the ability to gather statistically representative data on faults observed during production lifetimes of CPU and DRAM devices. While our data is collected primarily from supercomputing centers, the results of our analysis are applicable to any large data center or compute cluster where hardware reliability is important. This paper adds several novel insights about system reliability to the existing literature and also highlights certain aspects of analyzing field data that are critical to perform correctly in order to derive correct insights. Broad contributions include: • A detailed analysis of DRAM and SRAM faults on Hop- per. This data complements existing studies on faults in other production systems, including Jaguar, Cielo, and Blue Waters, thus furthering our understanding of DRAM and SRAM faults [12][33][34]. • The effect of altitude on DRAM fault rates. To our knowl- edge, this is the first study to present an analysis of altitude effects on DRAM in production systems. • The impact of counting errors instead of faults to evaluate system reliability. Counting errors is common among researchers and data center operators (e.g., [12][30]), but no one has quantified the effect of this methodology on the accuracy of the reliability conclusions obtained. • The effect of several hardware-based resilience tech- niques on system reliability, including SEC-DED ECC, chipkill ECC, and command/address parity in DRAM subsystems. We also examine SRAM resilience schemes, and analyze in-depth the impact of design choices on SRAM reliability. • A projection of the impacts of SRAM and DRAM faults on future exascale-class systems. Our study has several key findings that are of interest to system designers and operators. These findings are wide ranging and cover a variety of hardware resilience techniques in common use in the industry. We group these findings into three categories: the good, the bad, and the ugly. The Good. Our study highlights several advances made in the understanding of fault behavior and system reliability. These include: • DDR command and address parity, an addition to JEDEC’s DDR-3 and DDR-4 specifications, has a substantial pos- itive effect on system reliability, detecting errors at a rate comparable to the uncorrected error rate of chipkill ECC. • The majority of observed SRAM faults in the field are transient faults from particle strikes. This further confirms that SRAM faults are a well-understood phenomenon in current and near-future process technologies. • The majority of uncorrected SRAM errors are due to single-bit faults; therefore, reducing SRAM uncorrected error rates is a matter of engineering time and effort (e.g., replacing parity codes with ECCs) rather than new research and novel techniques. Furthermore, appropriate multi-bit fault protection reduces or eliminates the expected increase in uncorrected error rate from SRAM faults due to high altitude. The Bad. Unfortunately, some of our findings point to areas where more work needs to be done, or more understanding is required. These include: • Altitude increases the fault rate of some DRAM devices, though this effect varies by vendor. This is evidence that some (but not all) DRAM devices are susceptible to transient faults from high-energy particle strikes. • Unlike on-chip SRAM, external memory (e.g., DRAM) of future systems will require stronger resilience techniques than exist in supercomputing systems today. The Ugly. Finally, some of our results show potential problems in commonly-used practices or techniques: • Performing field studies is difficult. For instance, we demonstrate that counting errors instead of faults can lead to incorrect conclusions about system reliability. We examine results from recent studies that have used error counts, and use our data to show that using this methodology can result in misleading or inaccurate conclusions. • SEC-DED ECC, a commonly used ECC technique, is poorly suited to modern DRAM subsystems and may result in undetected errors (which may cause silent data corruption) at a rate of up to 20 FIT per DRAM device, an unacceptably high rate for many enterprise data centers and high-performance computing systems. The rest of this paper is organized as follows. Section 2 defines the terminology we use in this paper. Section 3 discusses related studies and describes the differences in our study and methodology. Section 4 explains the system and DRAM configurations of Cielo and Hopper. Section 5 describes our experimental setup. Section 6 presents baseline data on faults. Section 7 examines the impact of counting errors instead of faults. Section 8 presents our analysis of existing hardware resilience techniques. Section 9 extracts lessons from our data for system designers. Section 10 presents our projections for future system reliability, and Section 11 concludes. 2. Terminology Parameter Nodes Sockets / Node Cores / Socket DIMMs / Socket Location Altitude In this paper, we distinguish between a fault and an error as follows [6]: • A fault is the underlying cause of an error, such as a stuck-at bit or high-energy particle strike. Faults can be active (causing errors), or dormant (not causing errors). • An error is an incorrect portion of state resulting from an active fault, such as an incorrect value in memory. Errors may be detected and possibly corrected by higher level mechanisms such as parity or error correcting codes (ECC). They may also go uncorrected, or in the worst case, completely undetected (i.e., silent). Hardware faults can further be classified as transient, intermittent, or hard [7] [10] [11]. Distinguishing a hard fault from an intermittent fault in a running system requires knowing the exact memory access pattern to determine whether a memory location returns the wrong data on every access. In practice, this is impossible in a large-scale field study such as ours. Therefore, we group intermittent and hard faults together in a category of permanent faults. In this paper, we examine a variety of error detection and correction mechanisms. Some of these mechanisms include: parity, which can detect but not correct any single-bit error; single-error-correction double-error-detection error correcting codes (SEC-DED ECCs), which can correct any singlebit error and detect any double-bit error; and chipkill ECCs. We discuss two levels of chipkill ECC: chipkill-detect ECC, which can detect but not correct any error in a single DRAM chip; and chipkill-correct, which can correct any error in a single DRAM chip. To protect against multi-bit faults, structures with ECC or parity sometimes employ bit interleaving, which ensures that physically adjacent bitcells are protected by different ECC or parity words. 3. Related Work During the past several years, multiple studies have been published examining failures in production systems. In 2006, Schroeder and Gibson studied failures in supercomputer systems at LANL [29]. In 2007, Li et al. published a study of memory errors on three different data sets, including a server farm of an Internet service provider [21]. In 2009, Schroeder et al. published a large-scale field study using Google’s server fleet [30]. In 2010, Li et al. published an expanded study of memory errors at an Internet server farm and other sources [20]. In 2012, Hwang et al. published an expanded study on Google’s server fleet, as well as two IBM Blue Gene clusters [16], Sridharan and Liberty presented a study of DRAM failures in a high-performance computing system [33], and El-Sayed et al. published a study on temperature effects of DRAM in data center environments [14]. In 2013, Siddiqua et al. presented a study of DRAM failures from client and server systems [31], and Sridharan et al. presented a study of DRAM and SRAM faults, with a Cielo 8568 2 8 4 Los Alamos, NM 7,320 ft. Hopper 6000 2 12 4 Oakland, CA 43 ft. Table 1. System Configuration Information. focus on positional and vendor effects [34]. Finally, in 2014, Di Martino et al. presented a study of failures in Blue Waters, a high-performance computing system at the University of Illinois, Urbana-Champaign [12]. Our study contains larger scale and longer time intervals than many of the prior studies, allowing us to present more representative data on faults and errors. Among those studies with similar or larger scale to ours, many count errors instead of faults [12][14][16][30], which we show in Section 7 to be inaccurate. In addition, few of these prior studies examine the efficacy of hardware resilience techniques such as SECDED, chipkill, and DDR command/address parity. There has been laboratory testing on DRAM and SRAM dating back several decades (e.g., [9][13][23][24][27]), as well as a recent study by Kim et al. on DRAM disturbance faults [19]. Lab studies such as these allow an understanding of fault modes and root causes, and complement field studies that identify fault modes which occur in practice. 4. Systems Configuration Our study comprises data from two production systems in the United States: Hopper, a supercomputer located in Oakland, California, at 43 feet in elevation; and Cielo, a supercomputer in Los Alamos, New Mexico, at around 7,300 feet in elevation. A summary of relevant statistics on both systems are given in Table 1. Hopper contains approximately 6,000 compute nodes. Each node contains two 12-core AMD OpteronTM processors, each with twelve 32KB L1 data caches, twelve 512KB L2 caches, and one 12MB L3 cache. Each node has eight 4GB DDR-3 registered DIMMs for a total of 32GB of DRAM. Cielo contains approximately 8,500 compute nodes. Each node contains two 8-core AMD OpteronTM processors, each with eight 32KB L1 data caches, eight 512KB L2 caches, and one 12MB L3 cache. Each node has eight 4GB DDR-3 registered DIMMs for a total of 32GB of DRAM. The nodes in both machines are organized as follows. Four nodes are connected to a slot which is a management module. Eight slots are contained in a chassis. Three chassis are mounted bottom-to-top (numerically) in a rack. Cielo has 96 racks, arranged into 6 rows each containing 16 racks. 4.1 DRAM and DIMM Configuration In both Hopper and Cielo, each DDR-3 DIMM contains two ranks of 18 DRAM devices, each with four data (DQ) signals (known as an x4 DRAM device). In each rank, 16 of the Billions of DRAM Hours 0 4 8 12 11.13 0.96 A Figure 1. A simplified logical view of a single channel of the DRAM memory subsystem on each Cielo and Hopper node. DRAM devices are used to store data bits and two are used to store check bits. A lane is a group of DRAM devices on different ranks that shares data (DQ) signals. DRAMs in the same lane also share a strobe (DQS) signal, which is used as a source-synchronous clock signal for the data signals. A memory channel has 18 lanes, each with two ranks (i.e., one DIMM per channel). Each DRAM device contains eight internal banks that can be accessed in parallel. Logically, each bank is organized into rows and columns. Each row/column address pair identifies a 4-bit word in the DRAM device. Figure 1 shows a diagram of a single memory channel. Physically, all DIMMs on Cielo and Hopper (from all vendors) are identical. Each DIMM is double-sided. DRAM devices are laid out in two rows of nine devices per side. There are no heatsinks on DIMMs in Cielo or Hopper. The primary difference between the two memory subsystems is that Cielo uses chipkill-correct ECC on its memory subsystem, while Hopper uses chipkill-detect ECC. 5. Experimental Setup For our analysis we use three different data sets - corrected error messages from console logs, uncorrected error messages from event logs, and hardware inventory logs. These three logs provided the ability to map each error message to specific hardware present in the system at that point in time. Corrected error logs contain events from nodes at specific time stamps. Each node in the system has a hardware memory controller that logs corrected error events in registers provided by the x86 machine-check architecture (MCA) [5]. Each node’s operating system is configured to poll the MCA registers once every few seconds and record any events it finds to the node’s console log. Uncorrected error event logs are similar and contain data on uncorrected errors logged in the system. These are typically not logged via polling, but instead logged after the node reboots as a result of the uncorrected error. 10.21 B DRAM Vendor C Figure 2. DRAM device-hours per vendor on Hopper. Even for the least-populous vendor (A), we have almost 1B device-hours of data. Both console and event logs contain a variety of other information, including the physical address and ECC syndrome associated with each error. These events are decoded further using configuration information to determine the physical DRAM location associated with each error. For this analysis, we decoded the location to show the DIMM, as well as the DRAM bank, column, row, and chip. We make the assumption that each DRAM device experiences a single fault during our observation interval. The occurrence time of each DRAM fault corresponds to the time of the first observed error message per DRAM device. We then assign a specific type and mode to each fault based on the associated errors in the console log. Our observed fault rates indicate that fewer than two DRAM devices will suffer multiple faults within our observation window, and thus the error in this method is low. Prior studies have also used and validated a similar methodology (e.g., [34]). Hardware inventory logs are separate logs and provide snapshots of the hardware present in each machine at different points in its lifetime. These files provide unique insight into the hardware configuration that is often lacking on other systems. In total we analyzed over 400 individual hardware inventory logs that covered a span of approximately three years on Hopper and two years on Cielo. Each of these files consist of between 800 thousand and 1.5 million lines of explicit description of each host’s hardware, including configuration information and information about each DIMM such as the manufacturer and part number. We limit our analysis to the subset of dates for which we have both hardware inventory logs as well as error logs. For Hopper, this includes 18 months of data from April 2011 to January 2013. For Cielo, this includes 12 months of data from July 2011 to November 2012. For both systems, we exclude the first several months of data to avoid attributing hard faults that occurred prior to the start of our dataset to the first months of our observation. Also, in the case of both systems, time periods where the system was not in a consistent state or was in a transition state were excluded. For confidentiality purposes, we anonymize all DIMM vendor information. Because this is not possible for the CPU 5.1 Methodology Both Cielo and Hopper include hardware scrubbers in DRAM, L1, L2, and L3 caches. Therefore, we can identify permanent faults as those faults that survive a scrub operation. Thus, we classify a fault as permanent when a device generates errors in multiple scrub intervals (i.e., when the errors are separated by a time period that contains at least one scrub operation), and transient when it generates errors in only a single scrub interval. For example, Cielo’s DRAM scrub interval is 24 hours. A fault that generates errors separated by more than 24 hours is classified as permanent, while a fault that generates errors only within a 24-hour period is classified as transient. Our observed fault rates indicate that fewer than two DRAM devices will suffer multiple faults within our observation window. Therefore, similar to previous field studies, we make the simplifying assumption that each DRAM device experiences a single fault during our observation interval [33]. The occurrence time of each DRAM fault corresponds to the time of the first observed error message from that DRAM device. We then assign a specific type and mode to each fault based on the subsequent errors from that device in the console logs. We use a similar methodology (based on fault rates) for SRAM faults. 6. Baseline Data on Hopper Faults In this section, we present our baseline data on DRAM and SRAM fault modes and rates in Hopper. We have published similar information on Cielo in prior work [34]. 6.1 DRAM Faults Figure 2 shows the number of DRAM-hours per vendor during our measurement interval on Hopper. Our data consists of approximately 1 billion DRAM-hours of operation or more for each vendor, enough to draw statistically meaningful conclusions. Figure 3 shows the DRAM fault rate over time in Hopper. Similar to other systems, Hopper experienced a declining rate of permanent DRAM faults and a constant rate of transient faults during our measurement interval [33] [34]. Table 2 shows a breakdown of DRAM fault modes experienced in Hopper. Similar to prior studies, we identify several unique DRAM fault modes: single-bit, in which all errors map to a single bit; single-word, in which all errors map to a single word; single-column, in which all errors map to a single column; single-row, in which all errors map to a ● Total ● Permanent ● Transient FITs per DRAM Device vendor, we present all CPU data in arbitrary units (i.e., all data is normalized to one of the data points in the graph). While this obscures absolute rates, it still allows the ability to observe trends and compare relative rates. Since every data center and computing system is different, this is often the most relevant information to extract from a study such as this. 30 ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 5 6 7 8 ● 9 10 11 12 13 14 15 16 17 18 19 20 21 30 Day Period Figure 3. Hopper DRAM device fault rates per month (30day period); 22.3 billion DRAM device hours total. The rate of permanent faults decreases over time, while the rate of transient faults remains approximately constant. Fault Mode Single-bit Single-word Single-column Single-row Single-bank Multiple-bank Multiple-rank Total Faults 78.9% 0.0% 5.9% 9.2% 4.3% 0.6% 1.0% Transient 42.1% 0.0% 0.0% 1.8% 0.4% 0.0% 0.2% Permanent 36.8% 0.0% 5.9% 7.4% 3.9% 0.6% 0.8% Table 2. DRAM fault modes in Hopper. single row; single-bank, in which all errors map to a single bank; multiple-bank, in which errors map to multiple banks; and multiple-rank, in which errors map to multiple DRAMs in the same lane. Similar to other DDR-2 and DDR-3 systems, a majority of DRAM faults are single-bit faults, but a non-trivial minority of faults are large multi-bit faults, including row, column, bank, and chip faults. Figure 4 shows the transient and permanent fault rate in Hopper in FIT per DRAM device, broken down by DRAM vendor. Vendors A, B, and C in Hopper are the same as vendors A, B, and C in Cielo [34]. The figure shows substantial variation in per-vendor FIT rates, consistent with data from Cielo [34]. This implies that the choice of DRAM vendor is an important consideration for system reliability. 6.1.1 Effect of Altitude It is well-known that altitude will have an effect on modern SRAM devices due to high-energy neutrons from cosmic radiation [7], and particle strikes in DRAM were a problem in the past [23]. It is uncertain, however, whether high-energy neutrons are a significant source of faults in modern DRAM devices. We have previously published data on DRAM faults in Cielo [34]. We use this data in conjunction with Hopper data to examine the effect of altitude on DRAM faults by comparing fault rates of similar DRAM devices in Cielo and Hopper. The Hopper and Cielo memory subsystems are extremely similar: the systems use memory from the same three Relative FIT/DRAM 1.0 2.0 3.0 Transient Permanent 22.2 21.5 0 5 FITs/DRAM 10 15 20 25 30 35 32.3 B DRAM Vendor 1.71 0.88 C 0.0 A 2.92 Figure 4. Fault rate per DRAM vendor. Hopper sees substantial variation in fault rates by DRAM vendor. DRAM vendors, share the same DDR-3 memory technology, and use the same memory controllers. A major difference between the two systems is their altitude: the neutron flux, relative to New York City, experienced by Cielo is 5.53 and the relative neutron flux experienced by Hopper is 0.91[1]. Therefore, by comparing fault rates in Cielo and Hopper, we can determine what effect, if any, altitude has on DRAM fault rates. Figure 5 plots the rate of transient faults on Cielo relative to the rate of transient faults on Hopper, using Cielo data from [34]. The figure shows that vendor A experiences a substantially higher transient fault rate on Cielo, while vendor B experiences a modestly higher transient fault rate and vendor C experiences virtually the same transient fault rate. Table 3 breaks down this data and shows the single-bit, single-column, and single-row transient fault rates for each system by DRAM vendor. The table shows that the singlebit transient fault rates for vendors A and B are substantially higher in Cielo than in Hopper, as are the single-column and single-bank transient fault rates for vendor A. The rates of all other transient fault modes were within 1 FIT of each other on both systems. Due to the similarities between the two systems, the most likely cause of the higher single-bit, single-column, and single-bank transient fault rates in Cielo is particle strikes from high-energy neutrons. Our main observation from this data is that particleinduced transient faults remain a noticeable effect in DRAM devices, although the susceptibility varies by vendor and there are clearly other causes of faults (both transient and permanent) in modern DRAMs. Also, our results show that while altitude does have an impact on system reliability, judicious choice of the memory vendor can reduce this impact. 6.1.2 Multi-bit DRAM Faults Figure 6 shows an example of a single-bank fault from Hopper of multiple adjacent DRAM rows with bit flips. The figure shows that errors from this fault were limited to several columns in three logically adjacent rows. Multiple faults within our dataset matched this general pattern, although the occurrence rate was relatively low. All errors from this fault were corrected by ECC. A B DRAM Vendor C Figure 5. Cielo DRAM transient fault rates relative to Hopper (Hopper == 1.0). Vendor A shows a significantly higher transient fault rate in Cielo, likely attributable to altitude. Fault Mode Single-bit Single-column Single-bank Vendor A B C A B C A B C Cielo 31.85 15.61 6.10 6.05 0.55 0.18 14.65 0.07 0.18 Hopper 18.75 9.70 7.93 0.0 0.0 0.0 0.0 0.09 0.10 Table 3. Rate of single-bit, single-column and single-bank transient DRAM faults in Cielo and Hopper (in FIT). The higher rate of transient faults in Cielo may indicate that these faults are caused by high-energy particles. The pattern of errors from this fault appears to be similar to a fault mode called DRAM disturbance faults or “rowhammer” faults [19] [26]. This fault mode results in corruption of a “victim row” when an “aggressor row” is opened repeatedly in a relatively short time window. We did not do root-cause analysis on this part; therefore, we cannot say for certain that the fault in question is a disturbance fault. However, it is clear that fault modes with similar characteristics do occur in practice and must be accounted for. Our primary observation from data in this section is that new and unexpected fault modes may occur after a system is architected, designed, and even deployed: row-hammer faults were only identified by industry after Hopper was designed, and potentially after it began service [26]. Prior work has looked at tailoring detection to specific fault modes (e.g., [35]), but the data show that we need to be able to handle a larger variety of fault modes and that robust detection is crucial. 6.2 SRAM Faults In this section, we examine fault rates of SRAM in the AMD OpteronTM processors in Hopper. 10000 5446 5000 D D L1 413 1 58 1311 at a Ta L1 g ID a L1 ta IT a L2 g D at L2 a Ta g L2 TL L3 B D at L3 a LR U L3 Ta g 226 175 407 3 D 0 1.0 0.8 L2 Data L3 Data L2 Tag L3 Tag Figure 8. Rate of SRAM faults in Hopper compared to accelerated testing. 14214 15000 L1 Relative Fault Rate per Socket Figure 6. A single-bank fault from one Hopper node. 1.2 1.4 DRAM Column 1.109 1.025 0.6 ● 400 402 406 40a 414 418 480 4fe 540 580 5d8 5f0 602 60a 640 682 6b2 6b8 6c0 700 724 730 75c ● 1.057 0.4 ● 2b10 1.214 0.2 ● ● ● ● ●● ●●● ●●●●● ●●●●●● ● 0.0 DRAM Row 2b11 Per−bit Fault Rate Relative to Accelerated Testing ● 2b14 SRAM Structure Figure 7. Rate of SRAM faults per socket in Hopper, relative to the fault rate in the L2TLB. The fault rate is affected by structure size and organization, as well as workload effects. 6.2.1 SRAM Fault Rate Figure 7 shows the rate of SRAM faults per socket in Hopper, broken down by hardware structure. The figure makes clear that SRAM fault rates are dominated by faults in the L2 and L3 data caches, the two largest structures in the processor. However, it is clear that faults occur in many structures, including smaller structures such as tag and TLB arrays. A key finding from this figure is that, at scale, faults occur even in small on-chip structures, and detection and correction in these structures is important to ensure correctness for high-performance computing workloads running on largescale systems. This result is significant for system designers considering different devices on which to base their systems. Devices should either offer robust coverage on all SRAM and significant portions of sequential logic, or provide alternate means of protection such as redundant execution [36]. 6.2.2 Comparison to Accelerated Testing Accelerated particle testing is used routinely to determine the sensitivity of SRAM devices to single-event upsets from energetic particles. To be comprehensive, the accelerated testing must include all particles to which the SRAM cells will be exposed to in the field (e.g., neutrons, alpha particles) with energy spectra that approximate real-world conditions. Therefore, it is important to correlate accelerated testing data with field data to ensure that the conditions are in fact similar. Accelerated testing on the SRAM used in Hopper was performed using a variety of particle sources, including the high-energy neutron beam at the LANSCE ICE House[22] at LANL and an alpha particle source. The testing was “static” testing, rather than operational testing. Static testing initializes SRAM cells with known values, exposes them to the beam, and then compares the results to the initial state to look for errors. The L2 and L3 caches employ hardware scrubbers, so we expect the field error logs to capture the majority of bit flips that occur in these structures. Figure 8 compares the per-bit rate of SRAM faults in Hopper’s L2 and L3 data and tag arrays to results obtained from the accelerated testing campaign on the SRAM cells. The figure shows that accelerated testing predicts a lower fault rate than seen in the field in all structures except the L2 tag array. The fault rate in the L2 tag is approximately equal to the rate from accelerated testing. Overall, the figure shows good correlation between rates measured in the field and rates measured from static SRAM testing. Our conclusion from this correlation is that the majority of SRAM faults in the field are caused by known particles. While an expected result, confirming expectations with field data is important to ensure that parts are functioning as specified, to identify potential new or unexpected fault modes that may not have been tested in pre-production silicon, and to ensure accelerated testing reflects reality. 7. Fallacies in Measuring System Health All data and analyses presented in the previous section refer to fault rates, not error rates. Some previous studies have reported system error rates instead of fault rates [12][14][16][30]. Error rates are heavily dependent on a system’s software configuration and its workloads’ access patterns in addition to the health of the system, which makes error rates an imperfect measure of hardware reliability. In this section, we show that measuring error rates can lead to erroneous conclusions about hardware reliability. 9 10 8 ● 7 ● 6 ● 5 ● 4 ● ● 3 2 Mean ● ● Median ● ● ● ● ● 1 Relative DRAM Errors ● 1 2 3 4 5 6 7 8 9 ● ● 10 11 12 13 14 15 16 Month Figure 9. Hopper’s memory error rate relative to Cielo. Hopper has a memory error rate 4x that of Cielo, but a memory fault rate 0.625x that of Cielo. Because error counts are confounded by other factors such as workload behavior, they are not an accurate measure of system health. 7.1 Error Logging Architecture In most x86 CPUs, DRAM errors are logged in a register bank in the northbridge block [4] [31]. Each register bank can log one error at a time; x86 architecture dictates that the hardware discard subsequent corrected errors until the bank is read and cleared by the operating system [5]. The operating system typically reads the register bank via a polling routine executed once every few seconds [2]. Therefore, on a processor which issues multiple memory accesses per nanosecond, millions of errors may be discarded between consecutive reads of the register bank. The error logging architecture described above means that the console logs represent only a sample of all corrected errors that have occurred on a system. Moreover, the number of logged corrected errors is highly dependent on the polling frequency set by the operating system. For instance, if the operating system is using a 5-second polling frequency, only one error per node can be reported per 5-second interval. Uncorrected errors in x86 processors, on the other hand, are collected through a separate machine check exception mechanism [5]. These exceptions are individually delivered to the operating system, and thus uncorrected error counts do not suffer this sampling problem. 7.2 Counting Errors vs. Counting Faults Using our data, we provide two concrete examples of how counting errors can lead to incorrect conclusions about system reliability. Our first example is a study by Di Martino et al. on the Blue Waters system [12]. The second example is a study by Schroeder et al. on memory errors in Google data centers [30]. Example 1. First, we examine the Di Martino et al. claim that the chipkill ECCs used on Blue Waters’ DRAM corrected 99.997% of all logged errors. Because corrected errors are sampled while uncorrected errors are not, the chipkill codes on Blue Waters almost certainly corrected a much larger fraction of the total errors experienced by the system. More importantly, however, this type of analysis can lead to incorrect conclusions about the efficacy of ECC. For instance, using the authors’ methodology, we find that Hopper’s chipkill-detect ECC corrected 99.991% of all logged errors, while Cielo’s chipkill-correct ECC corrected 99.806% of all logged errors. This implies that Hopper’s ECC performs better than Cielo’s ECC. However, Cielo’s ECC actually had a 3x lower uncorrected error rate than that on Hopper, leading to the opposite conclusion: the chipkillcorrect code on Cielo performs substantially better than the chipkill-detect code on Hopper. Example 2. Our second example comes from a paper by Schroeder et al., where the authors quote a memory error rate of 25,000-75,000 FIT/Mbit, and claim that DRAM is orders of magnitude less reliable than previously thought [30]. The rate measured by the authors is the rate of logged errors, not the rate of faults [28]. Using this methodology, we find Hopper’s memory error rate was slightly more than 4x that of Cielo, again giving the impression that Hopper’s DRAM is less reliable than Cielo’s (see Figure 9). As stated in Section 6, however, Hopper has a fault rate of 25 FIT/device, compared to Cielo’s 40 FIT/device [34], demonstrating that Hopper’s DRAM is actually more reliable than Cielo’s. 7.3 Importance of Counting Faults The key observation in this section is, in order to obtain an accurate picture of system reliability, it is imperative to analyze the error logs to identify individual faults, rather than treating each error event as separate. Error event counts: (a) are more indicative of the OS’s polling frequency than of hardware health; and (b) overemphasize the effects of permanent faults, which can lead to thousands or millions of errors, while underreporting the effects of transient faults, which typically only result in a few errors but can be equally deleterious to system reliability. 8. Analysis of Hardware Resilience Schemes In this section, we use our data to examine various hardware resilience schemes in both DRAM and SRAM. Our goal is to understand the effectiveness of current hardware schemes. This data is useful for silicon and system designers who must consider a range of resilience techniques. 8.1 Comparing DRAM ECC Schemes Many DRAM ECC schemes are used in the industry. The most common schemes are SEC-DED ECCs, chipkill-detect ECCs, and chipkill-correct ECCs [18]. SEC-DED ECC corrects single-bit errors and detects double-bit errors. Chipkilldetect ECC detects any error from a single DRAM device, while chipkill-correct ECC corrects any error from a single device. Prior studies show that chipkill-correct reduces the uncorrected error rate by 42x relative to SEC-DED ECC [33]. However, no prior study has quantified the difference in undetected errors between SEC-DED and chipkill. Larger Error Preceded 2−bit Error 2−bit Error Preceded Larger Error 15 10 5 0 1.8 A B DRAM Vendor 0.2 C 0.0 0.2 0.4 0.6 0.8 1.0 21.7 20 Relative Uncorrected Error Rate FITs/DRAM 25 1 0.72 Address Parity ECC Figure 10. Rate of faults that generate errors which are potentially undetectable by SEC-DED ECC. Figure 11. Rate of DRAM address parity errors relative to uncorrected data ECC errors. Both Cielo and Hopper log the bits in error for each corrected error. By analyzing each error, we can determine whether a given error would be undetectable by SEC-DED ECC. For the purposes of this study, we assume that errors larger than 2 bits in a single ECC word are undetectable by SEC-DED (an actual SEC-DED ECC typically can detect some fraction of these errors). We call a fault that generates an error larger than 2 bits in an ECC word an undetectableby-SECDED fault. A fault is undetectable-by-SECDED if it affects more than two bits in any ECC word, and the data written to that location does not match the value produced by the fault. For instance, writing a 1 to a location with a stuck-at-1 fault will not cause an error. Not all multi-bit faults are undetectable-by-SECDED faults. For instance, many single-column faults only affect a single bit per DRAM row [33], and manifest as a series of ECC words with a single-bit error that is correctable by SEC-DED ECC. Figure 10 shows the rate of undetectable-by-SECDED faults on Cielo, broken down by vendor. The rate of these faults exhibits a strong dependence on vendor. Vendor A has an undetectable-by-SECDED fault rate of 21.7 FIT per DRAM device. Vendors B and C, on the other hand, have much lower rates of 1.8 and 0.2 FIT/device, respectively. A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6 FIT per node for vendors A, B, and C, respectively. This translates to one undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine the size of Cielo. Figure 10 also shows that 30% of the undetectable-bySECDED faults on Cielo generated 2-bit errors (which are detectable by SEC-DED) before generating a larger (e.g., 3bit) error, while the remaining 70% did not. We emphasize, however, that this detection is not a guarantee - different workloads would write different data to memory, and thus potentially exhibit a different pattern of multi-bit errors. Our main conclusion from this data is that SEC-DED ECC is poorly suited to modern DRAM subsystems. The rate of undetected errors is too high to justify its use in very large scale systems comprised of thousands of nodes where fidelity of results is critical. Even the most reliable vendor (vendor C) has a high rate of undetectable-by-SECDED faults when considering multiple nodes operating in parallel. Hopper and Cielo use chipkill-detect ECC and chipkillcorrect ECC, respectively, and therefore exhibit much lower undetected error rates than if they used SEC-DED ECC. 8.2 DDR Command and Address Parity A key feature of DDR3 (and now DDR4) memory is the ability to add parity-check logic to the command and address bus. The wires on this bus are shared from the memory controller to each DIMM’s register. Therefore, errors on these wires will be seen by all DRAM devices on a DIMM. Though command and address parity is optional on DDR2 memory systems, we are aware of no prior study that examines the potential value of this parity mechanism. The on-DIMM register calculates parity on the received address and command pins and compares it to the received parity signal. On a mismatch, the register signals the memory controller of a parity error. The standard does not provide for retry on a detected parity error, but requires the onDIMM register to disallow any faulty transaction from writing data to the memory, thereby preventing data corruption. The DDR3 sub-system in Cielo includes command and address parity checking. Figure 11 shows the rate of detected command/address parity errors relative to the rate of detected, uncorrected data ECC errors. The figure shows that the rate of command/address parity errors was 72% that of the rate of uncorrected ECC errors. The conclusion from this data is that command/address parity is a valuable addition to the DDR standard. Furthermore, increasing DDR memory channel speeds will likely cause an increase in signaling-related errors. Therefore, we expect the ratio of address parity to ECC errors to increase with increased DDR frequencies. 8.3 SRAM Error Protection In this section, we examine uncorrected errors from SRAM in Hopper and Cielo. Figure 12 shows the rate of SRAM uncorrected errors on Cielo in arbitrary units. The figure separately plots uncorrected errors from parity-protected structures and uncor- 8 7.38 6 3.62 4 2 1.00 Analysis of SRAM Errors In this section, we delve further into the details of observed SRAM uncorrected errors, to determine the root cause and potential avenues for mitigation. Figure 13 shows the rate of uncorrected errors (in arbitrary units) from several representative SRAM structures in Hopper. In the processors used in Hopper, the L1 data cache tag (L1DTag) is protected by parity, and all errors are uncorrectable. The L2 cache tag (L2Tag), on the other hand, is largely protected by ECC. However, a single bit in each L2 tag entry is protected by parity. Because the L2 cache is substantially larger than the L1, there are approximately half as many parity-protected bits in the L2 tag as there are bits in the entire L1 tag. This is reflected in the observed un- 0.50 a D L3 L2 at Ta g a at D L2 L1 IT ag Ta g D L1 ECC−protected SRAM Structure Figure 13. Rate of SRAM uncorrected errors per structure in Hopper from cache data and tag arrays. Each Hopper socket contains 12 L1 instruction and data caches, 12 L2 caches, and 2 L3 caches. 2.99 3 2 1.30 1.28 0.87 1 a at D L3 Ta g L2 IT ag L1 Ta g 0 D rected errors from ECC-protected structures, and includes structures in the core, all caches, and a variety of non-core arrays and FIFOs. ECC-protected structures include the L1, L2, and L3 caches, which comprise the majority of the die area in the processor. The figure shows that the majority of uncorrected errors in Cielo come from parity-protected structures, even though these structures are far smaller than the ECC-protected structures. Parity can detect, but cannot correct, single-bit faults, while ECC will correct single-bit faults. Therefore, the majority of SRAM uncorrected errors in Cielo are the result of single-bit, rather than multi-bit, faults. Our primary conclusion in this section is that the best way to reduce SRAM uncorrected error rates simply is to extend single-bit correction (e.g., SEC-DED ECC) through additional structures in the processor. While this is a nontrivial effort due to performance, power, and area concerns, this solution does not require extensive new research or novel technologies. Once single-bit faults are addressed, the remaining multibit faults may be more of a challenge to address, especially in highly scaled process technologies in which the rate and spread of multi-bit faults may increase substantially [17]. In addition, novel technologies such as ultra-low-voltage operation will likely change silicon failure characteristics [37]. Current and near-future process technologies, however, will see significant benefit from increased use of on-chip ECC. 3.99 0 L1 Parity−protected Figure 12. Rate of SRAM uncorrected errors in Cielo from parity- and ECC-protected structures. 8.4 Hopper Uncorrected Error Rate per Structure (AU) 0.2 Relative Uncorrected Error Rate per Structure (Cielo / Hopper) 0.0 0.2 0.4 0.6 0.8 1.0 Relative Uncorrected Error Rate 1 SRAM Structure Figure 14. Rate of SRAM uncorrected errors in Cielo relative to Hopper. ECC and aggressive interleaving seem to mitigate the expected increase in uncorrected error rate due to altitude. corrected error rate from the L2 tag, which is approximately half the rate of uncorrected errors from the L1 tag1 . Our observation from this data is that seemingly small microarchitectural decisions (e.g., the decision to exclude a single bit from ECC protection) can have a large impact on overall system failure rates. Therefore, detailed modeling and analysis of the microarchitecture (e.g., AVF analysis [25]) is critical to ensuring a resilient system design. Prior field studies have shown that SRAM fault rates have a strong altitude dependence due to the effect of atmospheric neutrons [34]. True to this trend, Cielo experiences substantially more corrected SRAM faults than Hopper. While first principles would predict a similar increase in SRAM uncorrected errors at altitude, prior studies have not quantified whether this is true in practice. Figure 14 shows the per-bit SRAM uncorrected error rate in Cielo relative to Hopper in the L1 instruction tag, L2 tag, and L3 data arrays. The figure shows that Cielo’s error rate in the L2 and L3 structures is not substantially higher than Hopper’s error rate, despite being at a significantly higher altitude. We attribute this to the error protection in these structures. These structures have both ECC and aggressive bit interleaving; therefore, a strike of much larger than two bits is required to cause an uncorrected error. Therefore, the 1 This L2 Tag behavior has been addressed on more recent AMD processors. 9. Lessons for Reliable System Design There are several lessons about reliable system design to be gleaned from our study. We believe this study both confirms and rebuts many widely-held assumptions and also provides new insights valuable to system designers and researchers. • Faults are unpredictable. Some faults may occur much more often than expected (e.g., DDR address parity) and some fault modes may not be known at design time (e.g., DRAM disturbance faults). Therefore, providing a robust set of error detectors is key. • Details matter. For instance, simply describing a cache as “ECC-protected” does not convey adequate information, since excluding even a single bit from this protection can have a large negative effect on system reliability. Careful analysis and modeling (e.g., AVF analysis) are required to predict the expected failure rate of any device. • The ability to diagnose is critical. This includes both ensuring that hardware logs have adequate diagnosis information as well as having access to appropriate software tools. For example, knowing exactly which bits are in error is critical to understanding fault modes and determining what went wrong. Adding this level of diagnosability to the hardware can require a much larger investment of time and effort than adding ECC to a structure in the first place. • Analysis is tricky. Qualitatively, our experience is that this type of field study is difficult to perform correctly. These studies require understanding the details of hardware and software behavior (some of which may be undocumented), mining extremely large data sets to separate the signal from the noise, and carefully interpreting the results. Quantitatively, our data shows that not following these steps can lead to incorrect conclusions about system reliability. • Scale changes everything. In large systems, even very small structures require protection due to the sheer number of components in a modern-day supercomputer or data center. Components that neglect this type of protection run the risk of corrupting data at non-trivial rates. 10. Projections to Future Systems In this section, we examine the impact of DRAM and SRAM faults on a potential exascale supercomputer. Our goal is Uncorrected Error Rate (Relative to Cielo) data points to the conclusion that the multi-bit error rate from high-energy neutrons in these structures is smaller than the error rate from other sources, such as Alpha particles, and a variety of hard fault mechanisms. Therefore, the data in this section suggest that appropriate error protection mechanisms can successfully offset the increase in raw fault due to altitude. We conclude that an appropriately-designed system need not be less reliable when located at higher elevations. ● 8Gbit / High FIT ● 16Gbit / High FIT ● 32Gbit / High FIT ● 8Gbit / Low FIT ● 16Gbit / Low FIT ● 32Gbit / Low FIT 80 70 60 50 40 30 20 10 0 ● ● ● ● ● ● ● ● ● ● ● 32PB ● ● ● ● ● ● ● ● 64PB 96PB ● ● ● ● 128PB System Memory Capacity Figure 15. Rate of DRAM uncorrected errors in an exascale supercomputer relative to Cielo. to understand whether existing reliability mechanisms can cope with the challenges presented by likely increases in system capacity and fault rates for future large-scale systems and data centers. To accomplish this, we scale our observed DRAM and SRAM error rates to likely exascale configurations and system sizes. We also model the impact of different process technologies such as FinFET transistors [15]. All analysis in this section refers to detected errors. Our projections are for currently-existing technologies and do not consider the impact of new technologies such as diestacked DRAM, ultra-low-voltage CMOS, or NVRAM. 10.1 DRAM Exascale supercomputers are predicted to have between 32PB and 128PB of main memory. Due to capacity limitations in die-stacked devices [32], much of this memory is likely to be provided in off-chip memory. The interface to these off-chip devices is will likely resemble current DDR memory interfaces. Therefore, it is critical to understand the reliability requirements for these off-chip sub-systems. Prior work has shown that DRAM vendors maintain an approximately constant fault rate per device across technology generations, despite reduced feature sizes and increased densities [9]. Therefore, we expect the per-device fault rates in an exascale computer will be similar to those observed in todays DRAM devices. Our goal is to determine whether current error-correcting codes (ECCs) will suffice for an exascale supercomputer. Therefore, we assume that each memory channel in an exascale system will use the same singlechipkill code in use on Cielo. As a result, we project that the per-device uncorrected error rate in an exascale supercomputer will be the same as the per-device error rate in Cielo. Because we do not know the exact DRAM device capacity that will be in mass production in the exascale timescale, we sweep the device capacity over a range from 8Gbit to 32Gbit devices. A larger DRAM device can deliver a specified system memory capacity with fewer total devices. Figure 15 shows the results of this analysis. The figure plots the system-wide DRAM uncorrected error rate for an exascale system relative to the DRAM uncorrected error 80 100 38.64 40 60 50.98 20 19.32 6.57 3.28 ) EC (E C C C ) d) e rg La Sm al e rg La l( le d) Sc a (S ca le rg Sm al l( La Sm al l e 0 Uncorrected Error Rate (Relative to Cielo) 101.96 Figure 16. Rate of SRAM uncorrected errors relative to Cielo in two different potential exascale computers composed of CPU and GPU/APU nodes. rate on Cielo. The figure plots two per-device error rates at each point, each taken from field data on different systems. The figure shows that uncorrected error rate for an exascale system ranges from 3.6 times Cielos uncorrected error rate at the low end to 69.9 times Cielos uncorrected error rate at the high end. At a system level, the increase in DRAM uncorrected error rates at the high end of memory capacity is comparable to the 40x reduction in DRAM uncorrected errors achieved when upgrading from SEC-DED ECC to chipkill ECC [33]. If we assume that SEC-DED ECC provides insufficient reliability for todays memory subsystems, our conclusion from these results is that higher-capacity exascale systems may require stronger ECC than chipkill. 10.2 SRAM We now turn our attention to SRAM failures on future systems. A socket in Cielo contains 18MB of SRAM in the L2 and L3 cache data arrays. Exascale-class processors are projected to see a substantial increase in processing power per socket, and thus will contain significantly more SRAM per node. Taking into account technology trends and reported structure sizes for CPU and GPU processors [3], we project that an exascale socket will contain over 150MB of SRAM, or an 8-10x increase in SRAM per socket over current supercomputers. The increase in SRAM relative to todays systems is less than the increase in DRAM because of the switch to general-purpose graphical processing units (GPGPUs) and/or GPGPUs integrated with CPU cores in an accelerated processing unit (APU), both of which rely less on large SRAM caches than traditional CPU cores. Assuming that SRAM fault rates remain constant in future years, the per-socket fault rates will increase linearly with SRAM capacity. Therefore, an exascale processor will see 8-10x the number of SRAM faults experienced by a current processor, and potentially 8-10x the rate of uncorrected SRAM errors. This translates to a system-level uncorrected error rate from SRAM errors of 50-100 times the SRAM un- corrected error rate on Cielo, depending on the system size. This is shown in the first group of bars of Figure 16, labeled Small for a low node-count system (50k nodes) and Large for a high node-count system (100k nodes). Per-bit SRAM transient fault rates have trended downwards in recent years [17]. If this trend continues, the SRAM uncorrected error rate per socket will be lower than our projections. For instance, according to Ibe et al. the per-bit SER decreased by 62% between 45nm and 22nm technology. If we see a corresponding decrease between current CMOS technologies and exascale technologies, the exascale system SRAM uncorrected error rate decreases to 19-39 times the uncorrected error rate on Cielo, shown in the second group of bars of Figure 16. Finally, as noted previously, the majority of uncorrected SRAM errors are due to single-bit faults. If these errors are eliminated in an exascale processor (e.g. by replacing parity with ECC), the exascale system SRAM uncorrected error rate would be only 3-6.5 times Cielos uncorrected error rates, shown in the third group of bars in Figure 16. Our conclusion from this analysis is that, vendors should focus aggressively on reducing the rate of uncorrected errors from SRAM faults. However, large potential reductions are available through reductions in SER due to technology scaling, and much of the remainder may be possible through expanding correction of single-bit faults. Once practical limits are reached, however, more advanced techniques to reduce the rate of multi-bit faults may be needed. 11. Summary Reliability will continue to be a significant challenge in the years ahead. Understanding the nature of faults experienced in practice can benefit all stakeholders, including processor and system architects, data center operators, and even application writers, in the quest to design more resilient largescale data centers and systems. In this paper, we presented data on DRAM and SRAM faults, quantified the impact of several hardware resilience techniques, and extracted lessons about reliable system design. Our findings demonstrate that, while systems have made significant strides over the years (e.g., moving from SEC-DED ECC to chipkill on the DRAM subsystem), there is clearly more work to be done in order to provide a robust platform for future large-scale computing systems. Acknowledgments We thank R. Balance and J. Noe from Sandia, and K. Lamb and J. Johnson from Los Alamos for information on Cielo, and T. Butler and the Cray staff at NERSC for data and information from Hopper. AMD, the AMD Arrow logo, AMD Opteron, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Product names used in this publication are for identification purposes only and may be trademarks of their respective companies. References [1] Flux calculator. http://seutest.com/cgi-bin/FluxCalculator.cgi. [2] mcelog: memory error handling in user space. halobates.de/lk10-mcelog.pdf. http:// [3] AMD. AMD graphics cores next (GCN) architecture. http://www.amd.com/us/Documents/GCN\ _Architecture\_whitepaper.pdf. [4] AMD. Bios and kernel developer guide (BKDG) for AMD family 10h models 00h-0fh processors. http://developer. amd.com/wordpress/media/2012/10/31116.pdf. [5] AMD. AMD64 architecture programmer’s manual volume 2: System programming, revision 3.23. http://amd-dev.wpengine.netdna-cdn.com/ wordpress/media/2012/10/24593_APM_v21.pdf, 2013. [6] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, Jan.-Mar. 2004. [7] R. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305–316, Sept. 2005. [8] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead, Sep. 2008. [9] L. Borucki, G. Schindlbeck, and C. Slayman. Comparison of accelerated DRAM soft error rates measured at component and system level. In IEEE International Reliability Physics Symposium (IRPS), pages 482–487, 2008. [10] C. Constantinescu. Impact of deep submicron technology on dependability of VLSI circuits. In International Conference on Dependable Systems and Networks (DSN), pages 205–209, 2002. Sub 50-nm FinFET: PMOS. In International Electron Devices Meeting (IEDM), pages 67–70, 1999. [16] A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 111–122, 2012. [17] E. Ibe, H. Taniguchi, Y. Yahagi, K. i. Shimbo, , and T. Toba. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. In IEEE Transactions on Electron Devices, pages 1527–1538, Jul. 2010. [18] X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, low-storage-overhead chipkill correct via multiline error correction. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 24:1–24:12, 2013. [19] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), pages 361 – 372, 2014. [20] X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In USENIX Annual Technical Conference (USENIXATC), pages 6–20, 2010. [21] X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In USENIX Annual Technical Conference (USENIXATC), pages 21:1–21:6, 2007. [22] P. W. Lisowski and K. F. Schoenberg. The Los Alamos neutron science center. In Nuclear Instruments and Methods, volume 562:2, pages 910–914, June 2006. [23] T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2–9, Jan. 1979. [11] C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14–19, Jul.-Aug. 2003. [24] A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557–1568, Dec. 2004. [12] C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In International Conference on Dependable Systems and Networks (DSN), pages 610–621, 2014. [25] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture (MICRO), pages 29–40, 2003. [13] A. Dixit, R. Heald, and A. Wood. Trends from ten years of soft error experimentation. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009. [26] J. T. Pawlowski. Memory errors and mitigation: Keynote talk for SELSE 2014. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2014. [14] N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 163–174, 2012. [27] H. Quinn, P. Graham, and T. Fairbanks. SEEs induced by high-energy protons and neutrons in SDRAM. In IEEE Radiation Effects Data Workshop (REDW), pages 1–5, 2011. [15] X. Huang, W.-C. Lee, C. Kuo, D. Hisamoto, L. Chang, J. Kedzierski, E. Anderson, H. Takeuchi, Y.-K. Choi, K. Asano, V. Subramanian, T.-J. King, J. Bokor, and C. Hu. [28] B. Schroeder. Personal Communication. [29] B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks (DSN), pages 249–258, 2006. [30] B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100– 107, Feb. 2011. [31] T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2013. [32] J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor. Resilient die-stacked DRAM caches. In International Symposium on Computer Architecture (ISCA), pages 416–427, 2013. [33] V. Sridharan and D. Liberty. A study of DRAM failures in the field. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 76:1–76:11, 2012. [34] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 22:1–22:11, 2013. [35] A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. In International Symposium on Computer Architecture (ISCA), pages 285–296, 2012. [36] J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compilermanaged GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73–84, 2014. [37] C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-l. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. In International Symposium on Computer Architecture (ISCA), pages 83–93, 2010.