Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Single event upset at ground level

1996, IEEE Transactions on Nuclear Science

Ground level upsets have been observed in computer systems containing large amounts of random access memory (RAM). Atmospheric neutrons are most likely the major cause of the upsets based on measured data using the Weapons Neutron Research (WNR) neutron beam

2742 zyxw zyxwvu zyxwvu IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 43, NO. 6, DECEMBER 1996 Single Event Upset at Ground Level Eugene Normand, Member, IEEE oeing Defense & Space Group, Seattle, WA 98124-2499 Abstract a sophisticated ground-based detector system made at 100, Ground level upsets have been observed in computer systems 5000 and 10,000 feet above sea level indicate that the 10-100 containing large amounts of random access memory fRAM). MeV flux falls off approximately linearly with altitude [8]. A~osphcricneutrons are most llkcly the major cause of the Very few measurements of thc neutron spectrum at groun upscts based on measured data using the Weapons Neutron level have been made, especially over the entire energy range. One set of the most recent terrestrial spccwal mcasurcments, Rescarch (WNR) neutron beam. made in Japan [9], was normalized to obtain the neutron spectrum expected in the US, based on scaling airplane I. INTRODUCTION spectral measurements made over Japan [9] and Several years after single event upset (SEU) was discovered These spectra show that the ground spectrum is roughly U300 in space in 1975, J. Ziegler [l] noted the potential for of that at 40000 ft. microelectronics on the ground to be susceptible to SEU from 111. SINGLE EVENT UPSETS AT GROUND LEVEL cosmic ray secondaries, primarily neutrons. Ziegler's work was prompted by the work of T. May and M. Woods [2] in uncovering errors in RAM chips due to upsets caused by the There is considerable evidence of upsets on the ground, but it alpha particles released by U and Th contaminants within the has been largely kept proprietary or else it has been in the chip packaging material. The alpha problem was regarded hands of computer systems engineers who do not u n d e r s t ~ d seriously and chip vendors took specific actions to reduce it to its meaning or implications. In the following paragraphs we l o l ~ ~ levels, b ~ e mainly by reducing the alpha particle flux will present various examples of this kind of data, including emitted by packaging and processing materials to generally < reference to the very recently revealed vast storehouse of data obtained by IBM over a 15-yearperiod via a well-coordinated 0.01 Wcm2-hs[3]. proprietary effort. In addition, five specific examples will be Unfor~nately,the potential for cosmic rays causing SEU on cited, one from a very large computer system that was taken the ground received little attention, and has received almost off line for testing, two from the error log/maintenance no public recognition on the part of chip vendors. Very history of a collection of large computers, one from a revealed that beginning in 1979, they biomedical device utilizing SRAMs that has been implanted un~er~ook a very large proprietary effort to understand the in hundreds of patients and one from the system soft error p~enomenonof upsets at ground level. This 15-year effort FIT rate (failures in time, i.e., IO9 device hours) testing involved many different disciplines and activities: field performed by RAM vendors. testing of memories, accelerated testing using cyclotron beams, detailed model development on all levels, In addition, we believe that there are extensive collections of e n v i r o ~ e nmonitoring ~ and coordination with device other data that provide evidence of these upsets, e.g. in the designers [4]. In contrast to the lack of recognition of the key error and/or maintenance logs of large computer systems. In role played by cosmic radiation for ground level upsets, the particular, the error logs of computer systems located in high importance of this mechanism was recognized by people altitude cities, such as in the Rocky Mountain region, are dealing with avionics, i.e., electronics in aircraft, relatively expected to reveal many such upsets. Although at present early in the open literature. Avionics SEU by the atmospheric such records have not yet been made public, we hope that neutrons was first predicted in 1984 [5] and later rigorously with the publication of this work, other SEU workers will demonswated to occur in flight in 1992[6]. work cooperatively with computer systems people within their organizations to uncover and reveal the large compilations of errors that exist. These errors have been detected, corrected LEVEL NEUTRON FLUX and logged by the dedicated software and hardware within The neutron environment at ground level can be defined in those computer systems, so the computer systems engineers terms of the models for the atmospheric neutron flux at are satisfied that their systems are well protected. However, higher altitudes which are mainly based on neutrons in the in addition, the EDAC (error detection and correction) energy range of 1<E< 10 MeV [7]. A number of studies have systems that work so effectively in protecting the large of the energy spectrum of the computer systems, can also reveal the mystery of those upsets x doesn't change with altitude or to SEU researchers who understand the mechanisms causing its absolute magnitude does vary with the errors. location and altitude around the earth ["I. Limited data from zyxwvut zyxwvut zyxwvut 0018-9499/96$05.00 0 1996 IEEE zyx zyxwvutsrqponm zyxwvutsrq zyxwvutsrqpo 2743 1II.A EARLYIBM STUDY An early study showed that when a large number of memories was monitored for single event upset at three locations of varying altitude (5000 feet, sea level and in a mine), the upset rate decreased with decreasing elevation, indicating that atmospheric neutrons are the likely cause [ll]. This study has been recently published in a much updated format [12, 131 that carefully separates out the upsets caused by alpha particles emitted by trace elements in the device package from those caused by the atmospheric neutrons. Using the atmosphLeric upset rate component at three locations within the US, the variation with altitude is the same as the ahnospheric neutron flux variation with altitude [12,13]. The very recently issued special edition of the IBM Journal of Researclh and Development (entirely devoted to the subject of ground level upsets), has a great deal of additional information on the many similar proprietary tests that IBM performed. The results of most of those tests are, however, presented in a relative or normalized format. In those instances in which we can infer absolute error rates, that data will be utilized (see discussion of FIT rates and Table 2 below). to be deposited in a device to "flip" a logic state, e.g., 0+1 [l], (factor of > 100 reduction in the rate for a doubling of the Qc value), whereas with neutrons and the recoils they produce, it is much more gradual. The Fermilab system contains DRAMS from two different manufacturers (and therefore, almost certainly, with different Qcvalues) and yet these showed no significant difference in upset rate. Other large computer systems with different DRAMS, including workstation clustered "computer farms" at Fermilab, also exhibit about the same upsetbit-hour rate as observed for ACPMAPS. The observed upset rate in the DRAMS of the ACPMAPS is much more consistent with the SEUs being caused by the atmospheric neutrons rather than packaging material alphas as will be shown below. If1.c UPSETRATES IN LARGE COMPUTER SYSTEMS An increasing number of off-the-shelf computers, in the workstation and larger classes, are being designed to incorporate EDAC to protect the RAM from errors. One such model is the Nite Hawk computer. Each Nite Hawk has approximately lGbit of DRAM memory, apportioned between global and local memory. Many of these computers have been used in a local systems integration laboratory, where the f1I.B U P S E T R A T E IN FERMILAB COMPUTER SYSTEM computer vendor also has the job of performing monthly maintenance on the machines. An informal assessment by The computer system ACPMAPS at Fermilab is a very large the computer maintenance people is that on the average, each system of individual computers, which when joined together, machine shows one upset (parity error) per month, with some contains about 160 Gbits of DRAM memory [14]. The having two errors and some none. Using the average value of ACPMAPS is housed in a computer building far removed one error per month (defined as 624 hours), this is equivalent from the very high energy Fermilab accelerators. It contains to a ground level upset rate of 1 . 6 ~ 1 0upset/bit-hr. -~~ 156 Gbits of 4 Mbit fast page-mode DRAM, guarded by parity but not protected by EDAC. In production it A more accurate measure of the error rate was obtained based consistently experiences single bit errors on an almost daily on a small number of errors from the error logs, acquired basis. When the entire system was taken off-line for testing, it over a few-month period of time. The SYSERR logs of five routinely gave an upset rate of 2.5 upset/day or 7E-13 Nite Hawk 5800 computers were checked; four are simulation upsetbit hr. computers and the fifth is a development computer. The logs for the four simulation computers covered about 4 months, It did not appear that these errors were being caused by while that for the developmental computer covered seven alphas in the packaging material. First, the rate observed was months. The four simulation computers experienced 0,1,2 5-10 times larger than that which could be inferred from the and 3 errors respectively; two of the six errors were in global results of the manufacturers' non-accelerated failure tests, and memory and four in local memory. The amount of total more tlhan 500 times larger than the FIT rate based on memory available in these four computers varies. All have 64 extrapolating from accelerated failure tests with an alpha Mbytes of global memory, two have 160 Mbytes of local source. Second, the chip vendor indicated that, based on lab memory and two have 256 Mbytes of local memory. Thus tests with alpha sources, almost all alpha-induced upsets in two machines have available 1.8 Gbits and two have 2.6 these DRAMs occur when a "page miss" (a change in the row Gbits of memory. At present on average the memory usage address) causes 4K bits of data to move from the DRAM cells on the simulation computers is estimated to be approximately to a small on-chip SRAM page. The window of vulnerability 50%. This leads to an upset rate of 2.5 E-12 upsetbit-hr. occurs when the long lines to the DRAM cells are active, so the error rate should be proportional to the rate of page The developmental computer appears to be run on a more misses (plus refreshes). Contrary to this, Fermilab found that consistent basis. Its error log covered a time period of -30 the 2.5 upset/day rate was independent of the rate of page weeks. This computer has 64 Mbytes in global memory and misses, which was varied by over a factor of ten. Finally, as 64 Mbytes in local memory for a total of 1Gbits. For this May and Woods showed [2], the alpha induced upset rate is machine an 80% usage factor for the total memory was extremely sensitive to critical charge, Qc,the charge that has estimated, and over that time period 2 errors (one in global zyxwvutsrq zyxwvutsr zy zyxwvu 2744 zyx zyxwvutsr zyxwv zyxwvutsrq memory and one in local memory) were encountered. The error rate for the developmental computer is thus 1.7E-12 upsetbit-hr. A more representative error rate was obtained by averaging the error rates of the all five of the Nite Hawk computers and this works out to 2.3E- 12 upset/(bit-hr). A second independent source of upset data is the Cray YMP8 located about ten miles away from the Nitehawk computers. The main memory of the YMP-8 consists of 32 modules, each with 256 Mbits of SRAM, for a total of -8.2 Gbits of SRAM. Each module comprises one thousand 256K x l SRAMs. The system is protected by a standard EDAC system know as SECDED, single error correct, double error detect. Upsets are found only during the read operation. SECDED is implemented by having Hamming code generated on every write operation. On every read operation the 72 bit word (64 bits comprising the word, 8 extra Hamming code bits) is again checked by the error detection circuit. If a single bit is off, the bit is corrected and the error logged; if a double bit error is found, no correction is attempted, but it is logged and flagged and the entire module is replaced. The new Cray Triton T-94 system , uses double error correct, multiple error detect (DECMED) system employing 12 check bits so that double bit errors can be corrected. It uses 2Mx2 SRAMs to comprise the memory in its modules in very compact memory stacks. the Poisson distribution and the actual upset data indicates that the source of most of the errors is random, such as SEUs produced by the atmospheric neutrons. It also indicates that the high error rates (> 8 errodmodule-yr) in two of the modules may be due to more than random error. The distribution also shows that the 10% most error-prone modules will be experiencing at least 6 error/module-yr. The utilization factor of the main memory is about 80% which has to be used to obtain a meaningful bit error rate that can be compared to the rate in other systems. Using the mean upset rate of 2.3 errodmodule-yr (133 total upsets), this converts to a bit error rate of 1.3E-12upsetlbit-hr. In addition to the main memory, the Cray also has a secondary bulk storage memory system called the Solid State Device (SSD). The SSD consists of a total of 32 Gbits of DRAM, in this case all in the form of 4Mx1 too is protected by EDAC. The error logs from the SSD were studied for the same 22-month period of time and it was found that the average number of errors was 2.71 month for a total of 60 errors. The utilization factor for the SSD is lower than for the main memory, with a value of 20% being a rough estimate. Therefore the 2.7 errorlmonth converts to a bit error rate of 6E-13 upsetbit-hr. The DRAMS also exhibited double bit errors, a total of 17 or -28% of all the errors. However 10 of the 17 double bit errors occurred during only two of the months. The number of single bit errors for those We were able to gain access to the system error logs for this two months was high but similar to the number of single bit Cray YMP-8 covering a period of 22 months (May 1992 - errors during several other months, and the number of errors February 1994). Over that time period, 30 out of the 32 in the main memory during those two months was about modules experienced one or more parity errors. During the average. Thus, although it is unclear why so many double bit first 16 months the parity errors were logged and date errors occurred during those two months, it appears that the stamped, but this changed in August 1993, after which the memory usage in the SSD may have been much higher than errors were logged but without a date stamp. To extract usual during those periods. individual upset data required careful interpretation of the 1 error logs. This was made easier through the assistance of the systems engineer, but it also required several assumptions 5 r to be made in order Eo interpret the data in a consistent 2 0.8 manner. Two examples of errors that were not counted as U random errors are illustrative: 1) errors in "flaky" RAM chips (defined as a chip that had the same error at the same location on 2 or more days over the 22 months) and 2) the large number of single bit errors that occurred on Oct. 11, 1992 in four modules during preventative maintenance (PM) because the PM-induced errors were registered in the modules in which the EDAC diagnostic wasn't tumed off. o~"":""l'"'!""~""!""! ..""?':'-'':.-..i 0 1 2 3 4 5 6 7 8 9 10 The parity error data was converted into a distribution of E rrorslhnodule-Yr parity errors per module. This distribution function is shown The Cumulative Dsitributiuion Function for in Fig. 1 and is normalized to the errors occurring on an Figure 1 Ground Level Errors (Error/Module-Year) in the Main annual basis. Since it is set up as a cumulative distribution, Memory of theCRAY YMP-8 we see that 50% of the modules will have 1.8-2 or more errors per module-year. This is consistent with the mean number of errors which is 2.3 error per module-yr. The 111.0 UPSETRATES FROM FIT RATETESTSBY RAM VENDORS figure also shows the theoretical cumulative probability function for a Poisson distribution based on a mean rate of RAM manufacturers typically perform two types of quality 2.3 error/module-yr. The generally good agreement between control tests at their facilities in which they record the bit zyxwvut zyxwvut zyxwvu - --.I zyxw zyxwv zyx zyxwvu 2745 error ratle, the rate being given in FIT units: a) system soft error rate (SSER), by monitoring 1000 parts for 1000 hours, and b) accelerated SER (ASER) obtained by using a radiation source. Historically, RAM vendors have used alpha sources to conduct their accelerated tests. The use of alpha sources for these tests goes back to the early problem of alpha contaminants in the chip packaging causing upsets [2], and it was standardized in terms of a test procedure [16]. However, use of the alpha source does not provide an accurate indication of the ground level upset rate as Lage [ 171 directly showed 'by comparing SSER and alpha-source ASER rates. Three other types of ASER testing have been proposed and used: a) proton beams to simulate neutron-induced upset (IBM, [118]), b) the WNR neutron spallation source at Los Alamos (TI [19] and Boeing [20]) and c) a 14 MeV neutron generator (used in conjunction with a calculational method by Boeing [21]). Examination of SSER FIT rates provides an excellent method of inferring the ground level upset rate. Unfortunately, few such measurements have been published. Those that are available are listed in Table 1 which contains test data conducted by Motorola [ 161 and by IBM [171. The upset rales are presented in terms of the FIT rate as well as the per bit rate. All of the Motorola data, which are for various types of Motorola SRAMs, are from SSER tests, and this data has error bars to indicate the poor statistics involved (typically very few errors, e.g., < 5, are measured). The ZBM data includes measurements from both SSER (field) and ASER (]proton beam) tests on 1M and 4M DRAMs . Two factors are to be noted with the IBM upset data: each of the averaged FIT rates is for a DRAM is from a different vendor, and there are no error bars indicative of the upset statistics. We note a wide variation in the upset rate among the various DFL4M devices and significant differences between the SSER and ASER results which is not typical of the measurements in many of their other tests. Nevertheless, taken as a composite, the ground level RAM upset rates listed in Table 1 are relatively consistent, mainly in the range of 1-2 E-12 upset/bit-hr, and are therefore similar to the ground level upset rates measured in the large computer systems discussed above. IV. ANALYSIS In summary, five different sources of ground level upset rates in RAM devices have been discussed. These are tabulated in Table 2. The upset rates agree with one another within less than an order of magnitude, and a rate in the range of 1-2 E12 upsetbit-hr appears to be about average. Thus the simple average value of 1.5 E-12 upset/bit-hr represents the entire range of rates, 0.3-2.3 E-12 upsetbit-hr, for both DRAMs and SRAMs, from the diverse sources of data. Our hypothesis is that the great majority of these upset are caused by the atmospheric neutrons, i.e., the cosmic ray secondaries at ground level. To demonstrate this, we will tabulate SEU measurements made on both SRAMs and DRAMs that were tested in the WNR neutron beam at the Los Alamos National Laboratory. As we have previously shown [22], the WNR neutron spectrum is essentially identical to that of the atmosphericneutrons. One hour in the WNR beam is equivalent to 2-3 E5 hours (beam intensity varies from year to year) at 40,000 ft, or alternatively, 6-9 E7 hours at ground level (the neutron flux at ground level is taken as U300 of that at 40,000 ft.). zyxwvu zyxwvutsrqpon Table 11 Ground Level Soft Error Rates Measured by RAM 10,300 IBM 1M D/A IBM 1M D/F Mot 256K S/F 2* 2" 3 1M S/F 2 Mot 4M S/F 4 Mot 3300 2500- 4100 325 230-420 500 4505601 2070' 133028001 5750 4500- 3.3E-12 3.1E-13 2E-12 2.1E-12 1.5E-12 8900$ -t D=DRAM, S=SRAM, A indicates accelerated testing using proton beam, F indicates field testing (1000 parts, 1000hrs) * In this case each device type was from a differentvendor $ Each Of these individually measured FIT rates had an uncertainty of about a factor Of 4 (2-0.5) based On the Small number of upsets and the probabilistic treatment of its confidence level Table 3 contains the WNR SEU cross section measurements on three DRAMs and six SRAMs, one of which has been previously published [22]. The WNR SEU response of the six S U M S on a per bit basis shows a fairly wide variation. However, when the Cypress parts, the only RAMs that exhibited multiple bit upset (a few percent of the single errors), are removed, the variation narrows significantly. It narrows even further if the only 4M SRAM, the MCM6246, which is notably less sensitive, is also removed. Among the three 4M DRAMs, there is also some variation, with the Oki part being notably less sensitive than the other two. None of the nine RAMs exhibited neutron-induced single event latchup, as was to be expected [20]. Column 4 of Table 3 contains the WNR SEU cross section (upsets/fluence > 10 MeV), column 5 the scaled SEU rate at ground level (based on a flux of 19.3 n/cmZ-hr on the ground) and column 6 the ground level SEU rate calculated using the burst generation rate (BGR) method [20]. The scaled neutron-induced SEU rates are in the same range of 0.5-2 E12 upset/bit-h as those actually measured on the ground as tabulated in Table 2. nusin making the comparison between the measured bit error rates from computer error logs, field SER data, etc., summarized in Table 2, these error I I zyxwvutzyxwvutsrqponml srqponmlkjihgfedcbaZYXWVUTSRQPONMLKJI H G zyxwvu zyxwv zyxwvu 2146 rates directly correlate with the neutron-induced upset rate tabulated in Table 3. A direct of the field upset rates and the rates scaled from shown in Table 4. reliability standards on microelectronics to encourage the development and use of low FIT-rate chips, and d) utilizing the appropriate and available accelerated SER techniques/tests to measure ground level FIT rates. As indicated, use of the WNR beam to measure RAM SEU rates is one of several accelerated SER methods, probably the best one because this neutron beam is so similar to the actual ospheric neutron spectrum. The IBM method uses a beam of 1.50 MeV protons to simulate the atmospheric neutrons [18], and they apply an empirically derived factor 17 to convert the measured SEU the ground level SEU rate (factor varies with s very similar to the use of the Table 3, in which the conversion ourly neutron flux at ground level > 10 MeV, 19.3 n/cm2-hr, that converts the WNR SEU cross 1 SEU rate. Because of the limited [20], we use another method which cross section data via the BGR as an efficient alternative to using effectiveness of this approach has been 21 for a few RAMS. By comparing columns 4 and 5 of Table 3 we provide further evidence of the effectiveness of the method. Nevertheless, this BGR augmented by measuring the SEU cross section trons to normalize the BGR parameters, m to the atmospheric neutron spectrum. The diversity of applications in item b) is extremely broad. Biomedical devices tend to be expensive, but due to the urgency of health considerations, additional costs for EDAC or SEU-immunechips can be readily absorbed and passed on. Industrial products might focus on process control applications for which some additional costs might also be warranted to protect against RAM errors. In contrast, commercial products tend to be highly cost competitive, and so the extra costs of error mitigation techniques might hardest to justify. However, in some instances, such as those related to financial transactions and “smart” cards, or the use of microelectronics-based automobile systems, the vital importance of dealing with such ground level errors, which are to be expected if no mitigation techniques are used, may be much more apparent. Each product may use << 1Mbit of RAM,but because millions of units expected to be sold, the total number of bit-hours in operation may still be large. zyxwvutsr V. CONCLUSIONS Thousands of single event upsets are occurring every year on the ground, yet few in the SEU community are aware of &em. These upsets have been recorded mainly in large computer systems equipped with EDAC to detect, correct and log in Having demonstrated that the atmospheric neutrons are these errors. We have examined a few such error logs from primarily responsible for the ground level upsets, there are a large computers, as well as other sources of ground level number of impacts that this cause-effectrelationship has that upset data. All of this data is consistent with the atmospheric extend beyond the SEU community. Some of these impacts neutrons being the main cause of the upsets. It is also the are summarized in Table 5 and include: a) improving the same conclusion reached years ago by the IBM team that reliability of large computer systems, b) applying error investigated this topic privately [4]. We demonstrated the mitigation techniques to RAMS used in biomedical, correlation by comparison with the neutron-induced SEU rate commercial and industrial products, c) imposing realistic System and Location Basis for Ground Rate zyxwvutsr zy 2747 zy Sections in WNR Beam RAM Meas'd WNR Gr'nd level SEU Calculated Gr'nd Weibull Fit Heavy Ion Size/Type* SEU X-section, Rate, Upbit-hr, SEU Rate, Upbit- Parameters Used in BGR cm2/bit WNR-Scaled hr, BGRMethod Calculation % 4M/D 1.2E-13 2.3E- 12 2.1E-12 4.7E- 7,0.85, 18.3, 1.13 ## 4M/D 2.2E- 14 4.3E-13 NIA 4 m 9.3E-14 1.8E-12 2.3E-12 ISS RAM Vendor TC5 14400-80 Toshiba MSM514400-80 Oki TMS44100[22] TI IDT7 1256 HM65656 MCM6206 MCM6246 CY7C195fi CY7C1997 IDT 256K/S Matra 256K/S Motorola 256K/S Motorola 4WS C Y ~ R S S256K/S Cypress 256K/S 6.5E- 14 1.9E-13 1.4E-13 1.25E-14 5.7E-13 5.2E-13 1.3E-12 3.7E-12 2.7E- 12 2.4E- 13 1.1E-11 1E-11 2.3E-12 1.2E-12 7E-13 3.4E-13 8.4E-12 ' 1.93E- 13 3.72E-12 2.48E- 12 (7unique RAMs) 3.8E-7, 1.98, 11.46,2.24 $ 2.7E-6,2.64,3005,0.636 5.3E-8,1.1.5.45.6.88 I :.98E-6, 1.02,33.7, 1.08 1 zyxwvutsrqp zyxwvutsrzy zyxwvuts Simple Average lfor 9 RAMs 1 % Weibull parameters (see [SI) are in following order: 00 (per bit), Lo, W and S; BGR method (see [20]) assumed t=2 pm and C=OS in all cases; Weibull parameters are from following related RAMs: # TC5141OOZ-10, $ HM65656 engineering samples 1231, Q MCM6226, and t CY7C185 fi These parts exhibited multiple bit upsets during the WNR testing. based on measurements with the WNR neutron beam. We have not focused on any one specific DRAM or SRAM,but rather on a representative sampling of RAMs to show that the correlation applies to both SRAMs and DRAMs, and applies fairly well regardless of which commercially available RAM is used (however this is not true for those RAMs specifically designed to have a low SEU sensitivity, e.g. the IBM LUNAC andEi DRAMs [26]). in their nomenclature) is still incorporated, but the correctable errors are no longer logged. The exact reason for eliminating the error logging is unclear (100% confidence in the ECC, increased speed, lower cost, etc.), but it will have an impact . On some of the older workstations that had much smaller memory capacities, the errors were in fact logged, but because of the smaller memory size, the errors occurred much less frequently. Systems administrators familiar with these older workstations can recall seeing the occurrence of single An upset rate in the range of 1-2 E-12 upsethit-hr was shown bit errors. Based on the data we presented, the lGbit to be representative of the rate that most SRAMs and DRAMs workstations should experience 1-2 errors per month, in actual field applications are experiencing, although there depending on how much of the memory is being used on a were a few with lower rates (see Table 2). The upset rate of daily basis. However, since memory requirements have 1-2 E-12 upset/bit-hr leads to FIT rates of 1000-2000FITS for expanded so dramatically over the last few years and are still a lMbil RAM, which is just at the limit of 2000 FITS for soft continuing to do so, the number of errors are likely to errors given in the STACK specification for integrated continue to increase at a rapid rate. However, without the circuits [25]. Thus we would expect that most RAMs of error logs, there will be no way to track this expected trend in larger memory capacity than 1M (e.g., 4M, 16M, etc) would increasing errors. not meet the STACK limit in actual field applications. RAM tests using an alpha source may yield a rate lower than this It has been suggested that it is the thermal neutron portion (E limit, but this study, and that by Lage [17], show that this is 0.025 eV) of the atmospheric neulron spectrum, rather than an erroneous test. The atmospheric neutrons are the cause of the high energy portion (E> 10 MeV), which is mainly most of the upsets on the ground, and the alpha particles do responsible for the upsets [27]. In this case the mechanism is ' fraction not simulate the neutron interactions with the RAMs, they that of the thermal neutrons interacting with the BO of the boron in the borophososilicate glass (BPSG) within the only simulate alphas emitted from the chip package. glassivation layer over the die that produces alpha particles. It should be noted that gaining access to error logs may not The energy deposition by the alphas leads to the upsets [27]. always be very easy. There is the case of one supercomputer A very similar mechanism was investigated earlier with manufacturer who, through a very stringent purchase respect to the BO ' content of boron dopants in agreement, precludes any owner of the supercomputer from microelectronics [28]. That analyis found that both the 1.5 divulging error information about that computer system. In MeV alpha and the 0.8 MeV Li recoil produced by thermal the case of workstations, which today have on the order of neutron interactions with Bi0 can deposit energy leading to lGbit or more of DRAM, EDAC @CC, error correcting code, - 2748 zyxwvutsr zyxw Table 4 Direct Comparison of RAM SEU Rates at Ground Level, From Field Measurements and Scaled from Measured SEU Cross Sections in WNR Neutron Beam , Network Fermilab, Batavia, zyxwvut zyxwv zyxwvu zyxwvutsr zyxwv Reduce RAM sensitivity through techniques know to SEU community [EDAC, SEU-h”ne SRAMs, use of other memories less susceptible to SEU (e.g., EEPROMs, flash EEPROMs, etc.)] Utilize existing expertise and methods to reduce possibility of RAM upsets at ground level in devices having widespread use (thousands-millions of individual products). In commercial biomedical, ~ndustr~al products use of low SEU-sensitive RAMs is generally precluded because of increased cost. Example of LUNA-E and C (EDAC) DRAMs developed by M to have low SEU rates. and commercial However, to be competitive in their THINKPAD laptop computer, IBM uses non-IBM DRAMS products [24]because they are cheaper. As FL4M devices continue to increase in memory capacity, microelectronics will no Impose realistic meet standards set by its own industry, e.g., 2000 FITS (per device) in STACK Spec microe~~tronics 12.1 [25]. They meet it now because the same standard provides for only an a source test, and i n d u s to ~ develop low a’s are not the real cause of the errors. Once atmospheric neutrons are recognized as main cause FIT-rate designs of errors, they will not be able to meet the maximum allowed FIT rate for RAMs > 1 Mbit. Effective SEU testing techniques can be applied to RAMS to quickly determine their ground level Utilize a ~ ~ a b l e sensitivity to upset (FIT rates). These test techniques are a much better and quicker way to accelerated SER provide feedback on the susceptibility of specific new RAM design features than the existing ~ec~iques/tes~s Improve reliability of upsets [28]. In that case, even for the most sensitive RAM tested with thermal neutrons, the upset cross section, in it, wm about three orders of magnitude smaller than from the WNR beam (Table 3). Furthermore, ground thermal neutron fluxes are greatly influenced by the ts of ~opography,soil water content and surrounding man-made materials [29]. For a very simple air/material geometry, the thermal neutron flux at the interface varies by a factor of 5 depending on the material [29]. This implies large variations in the thermal flux are possible just due to the material/geometry configuration surrounding a particular computer. In contrast, the measured ground level upset rates in Table 2 show much less variation. Thus for a number of reasons, including complete uncertainty of the BPSG content of commercial SRApvfs and DRAMs, large variation of the ground level thermal neutron flux from location to location, 2749 and old measurements showing a much lower upset cross section, we believe that the contribution of thermal neutrons to the ground level upset rate is small. It has also been suggested that other cosmic ray secondary particles, protons and pions, may also be responsible for the ground level upset rates [30]. These particles may contribute to some portion of the ground level upset rate, but the correlation above, between the measured ground level bit error rate (from error logs, RAM SSER FIT rates, etc.) and the WNR SEU rate measurements, indicate that the atmospheric neutrons are the dominant cause. We expect that additional examinations of other sources of ground level errors will further verify this contention. Such studies might show the effects of latitude and altitude on ground level rates, e.g., similar to the variation of the atmospheric neutron flux with latitude and altitude [SI, and of variations in the SEU response of different RAMS, such as that seen in Table 3. Furthermore, Such examinations will hopefully lead to expanded cooperation between the SEU community and the designers of microelectronics, computer systems and the diversity of commercial electronic products that use significant quantities of RAM, in terms of accounting for the effects of SEU in those products. republished as DNA-Report DNA-TR-94-123, DNA, Feb, 1995 7. 0. C. Allkofer and P. K. Grieder, Physics Data: Cosmic Ravs on Earth, Fachinformationszentm Energie, Physik, Mathematik GmbH, Karlsruhe, 1984 8. E. Normand and T. J. Baker, “Altitude and Latitude Variations in Avionics SEU and Atmospheric Neutron Flux”, IEEE Trans. Nucl. Sci., 40, 1484 (1993) 9. T. Nakamura et al, “Altitude Variation of Cosmic Ray Neutrons”,Health Phvsics, 53,509 (1987) 10. J. Hewitt et al, “Ames Collaborative Study of Cosmic Ray Neutrons: Mid-Latitude Flights”, Health Phvsics, 34. 375, 1978 11. T. OGorman, “An Experiment to Determine the Effect of Cosmic Rays on a FET Computer Memory”, paper presented at the Fourth Single Event Effects Symposium, Los Angeles, 1985 12. T. OGorman, “The Effect of Cosmic Rays on the Soft Error Rate of a DRAM at Ground Level”, IEEE Trans. Electron Devices, 4l, 553 (1994) 13. T. J. O’Gorman et al, “Field Testing for Cosmic Ray Soft Errors in Semiconductor Memories” IBM J. Res. Develop. 40,41, (1996) 14. M. Fischler, personal communication 15. F. Gardic et a1 “Analysis of Local and Global Transient Effects in a CMOS SRAM,” IEEE Trans. Nucl. Sci., 43,899 (1996) 16. “Package Induced Soft-Error Test Procedurc”, MIL STD 883D, Method 1032.1 17. C. Lage et al, “Soft Error Rate and Stored Charge Requirements in Advanced High Density SRAMs”, IEDM Tech. Digest, 821 (1993) 18. J. F. Ziegler et a1 “Accelerated Testing for Cosmic SoftError Rate,” IBM J. Res. DeveloE. 40,51, (1996) 19. W. R. McKee et al “Cosmic Ray Neutron Induced Upsets as a Major Contributor to the Soft Error Rate of Current and Future Generation of DRAMS,” paper presented at 1996 Proceedings of International Reliabilitv Physics SgmDosium, April, 1996 20. E. Normand, “Single-Event Effects in Avionics,” LEEE Trans. Nucl. Sci., fll, 461, 1996 21. E. Normand, D. L. Oberg, J. L. Wert, T. J. Baker and C. M. Castaneda, “Considerationsin Single Event Upset Testing with Energetic Neutrons”, paper presented at the Eighth Single Event Effects Symposium,Los Angeles, April, 1992 22, E. Normand, D. L. Oberg, J. L. Wert, J. D. Ness, P. P. Majewski, S. A. Wender and A. Gavron, “Single Event Upset and Charge Collection Measurements Using High Energy Neutrons and Protons”, IEEE Trans. Nucl. Sci., 41, 2203, 1994 23. R. Ecoffet, M LaBrunee, S . Duzellier and D. Falguere, “Heavy Ion Test Results on Memories,” 1992 IEEE Radiation Effects Data Workshop, p. 27 24. M. Martignano and R. Harbo-Sorensen, “IBM THINKPAD Radiation Testing and Recovery During Euromir Missions,” IEEE Trans. Nucl. Sci., 42,2004, 1995 zy zyxwvutsr zyxw zyxwvu zyx ACKNOWLEDGMENT The assistance provided by the following people is gratefully acknowledge with respect to information and hardware they furnished and useful discussions they participated in: S. R. Allen, W. M. Kearns, T. A. Krogel, P. P Majewski, D. L. Oberg, S. W. Snow and J. L. Wert of the Boeing Defense & Space Group, S. A. Wender of LANL, G. Eddy of Cray Research Inc., T. Corbiere of Matra MHS, J. F. Ziegler of IBM and M. Fischler of Fermilab. zyxwvutsrqp REFERENCES 1. J. F. Ziegler and W. A. Lanford, “Effectof Cosmic Rays on Computer Memories”,Science, 206,776 (1979) 2. T. C. May and M. H. Woods, ”A New Physical Mechanism for Soft Errors in Dynamic Memories, Proceedings 16 Int’l Reliability Physics Symposium, p. 33, April, 1978 3. A. fhsnain and A. Ditali, “Building-In Reliability: Soft Errors- A Case Study,” Proceedings. 30 Int’l Reliability Physics Svnuosium, p. 276 April, 1992 4. J. F. Ziegler et al, “IBM Experiments in Soft Fails in Computer Electronics (1979-1984)’ IBM J. Res. Develop. 40, 3, (1996) 5. R. Silberberg, C. H. Tsao and J. R. Letaw, “Neutron Generated Single Event Upset in the Atmosphere”, IEEE Trans. 1Vucl. Sci., NS-31,1066 and 1183, Dec. 1984 6. A. Taber and E. Normand, “Investigation and Characterization of SEU Effects and Hardening Strategies in Avionics”, IBM Report 92-L75-020-2, August, 1992, zyxw zyxwvu 2750 zyxwvutsrqponml zyxwvu zyxwvutsr zyxwvu zyxwvu 25. “General Requirements for Integrated Circuits”, Specification 0001, Issue 12.1, STACK International, Milton Keynes, UK, Sept. 1993 26. P. Calvel, P. Lamothe, C. Barillot, R. Ecoffet, S. Duzellier and E. 6. Stassinopoulos, “Space Radiation Evaluation of 16 Mhit DRAMS for Mass Memory Applications,” IEEE Trans. Nucl. Sci., 41,2267,1994 27. R. Baumann, T. Hossain, S . Murata and H. Kitagawa, “Boron Compounds as a Dominant Source of Alpha Particles in Semiconductor Devices,” Proceedings 1995 Int’lReliability Physics Symposium, p. 297, April, 1995 28. T. R. Oldhm, S . Murrill, and C. T. Self, “Single Event Upset of VLSI Memory Circuits Induced by Thermal Neutrons,” Radiation Effects, Research and Engineering, Vo1.5, No. 1, p. 6, 1986 29 K. O’Brien, H. Sandmeier, G. E. Hansen and J. E. Campbell, “Cosmic Ray Induced Neutron Background Sources and Fluxes for Geometries of Air Over Water, Ground, Iron and Aluminum,” J. GeoDhys. Res., 83, 114, (1978) 30. J. F. DiCello et al, “An Estimate of Error Rates in Integrated Circuits at Aircraft Altitudes and at Sea Level”, Nucl. Inst. and Methods, &U1295 (1989) 3 1. J. R. Letaw and E. Normand, “Guidelinesfor Predicting Single Event Upsets in Neutron Environments”,IEEE Trans. Nucl. Sci, NS-38, 1500,1991