Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Processor Design for Soft Errors: Challenges and State of the Art

Published: 08 November 2016 Publication History

Abstract

Today, soft errors are one of the major design technology challenges at and beyond the 22nm technology nodes. This article introduces the soft error problem from the perspective of processor design. This article also provides a survey of the existing soft error mitigation methods across different levels of design abstraction involved in processor design, including the device level, the circuit level, the architectural level, and the program level.

References

[1]
R. E. Ahmed, R. C. Frazier, and P. N. Marinos. 1990. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing, 1990 (FTCS-20’90), Digest of Papers. 82--88.
[2]
J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber. 2009. Future scaling of processor-memory interfaces. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY. Article 42, 12 pages.
[3]
G. M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18--20, 1967, Spring Joint Computer Conference (AFIPS’67 (Spring)). ACM, New York, NY. 483--485.
[4]
H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama. 2003b. A 1.3 GHz fifth generation SPARC64 microprocessor. In Proceedings of the IEEE International Solid-State Circuits Conference, 2003, Digest of Technical Papers. (ISSCC’03). Vol. 1. 246--491.
[5]
H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama. 2003a. A 1.3GHz fifth generation SPARC64 microprocessor. In Proceedings of the Design Automation Conference, 2003. 702--705.
[6]
T. M. Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO 32’99). IEEE Computer Society, 196--207.
[7]
A. Avižienis and L. Chen. 1977. On the implementation of N-version programming for software fault tolerance during execution. In Proceedings of the IEEE International Computer Software and Applications Conference. 149--155.
[8]
A. Avizienis. 1971. Arithmetic error codes: Cost and effectiveness studies for application in digital system design. IEEE Trans. Comput. 20, 11 (1971), 1322--1331.
[9]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 1, 1 (Jan. 2004), 11--33.
[10]
B. J. Babb. 1998. Error detection and correction circuit. (May 1998). Patent No. 5751744, Filed Jan. 27, 1995, Issued May 12, 1998.
[11]
A. Bachtold, P. Hadley, T. Nakanishi, and C. Dekker. 2001. Logic circuits with carbon nanotube transistors. Science 294, 5545 (2001), 1317--1320.
[12]
R. D. Bannon and M. M. Bhansali. 1984. Digital data storage error detecting and correcting system and method. (April 1984). Patent No. 0042966, Filed May 19, 1981, Issued Apr. 18, 1984.
[13]
W. B. Barker. 1977. Error-checking scheme. (July 1977). Patent No. 4035766, Filed Aug. 1, 1975, Issued July 12, 1977.
[14]
M. J. Barry. 1992. Radiation resistant sram memory cell. (Oct. 1992). Patent No. 5157625, Filed May 22, 1990, Issued Oct. 20, 1992.
[15]
J. F. Bartlett. 1981. A nonstop kernel. SIGOPS Oper. Syst. Rev. 15, 5 (1981), 22--29.
[16]
W. Bartlett and L. Spainhower. 2004. Commercial fault tolerance: A tale of two systems. IEEE Trans. Depend. Secure Comput. 1, 1 (Jan. 2004), 87--96.
[17]
R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Des. Test 22, 3 (May 2005), 258--266.
[18]
D. Binder, E. C. Smith, and A. B. Holman. 1975. Satellite anomalies from Galactic cosmic rays. IEEE Trans. Nucl. Sci. 22, 6 (1975), 2675--2680.
[19]
A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan. 2005. Computing architectural vulnerability factors for address-based structures. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE Computer Society. 532--543.
[20]
D. Blaauw, S. Kalaiselvan, K. Lai, Wei-Hsiang Ma, S. Pant, C. Tokunaga, S. Das, and D. Bull. 2008. Razor II: In situ error detection and correction for PVT and SER tolerance. In Proceedings of the IEEE International Solid-State Circuits Conference, 2008 (ISSCC’08), Digest of Technical Papers. 400--622.
[21]
T. G. W. Blake and T. W. Houston. 1993. Memory cell with capacitance for single event upset protection. (April 1993). Patent No. 5204990, Filed Sept. 7, 1988, Issued Apr. 20, 1993.
[22]
W. Bland. 2012. User Level Failure Mitigation in MPI. In Proceedings of the Euro-Par 2012: Parallel Processing Workshops: BDMC, CGWS, HeteroPar, HiBB, OMHI, Paraphrase, PROPER, Resilience, UCHPC, VHPC. Revised Selected Papers. Springer, Berlin. 499--504.
[23]
J. Blome, S. Mahlke, D. Bradley, and K. Flautner. 2005. A microarchitectural analysis of soft error propagation in a production-level embedded microprocessor. In Proceedings of the 1st Workshop on Architecture Reliability.
[24]
D. M. Blough, F. J. Kurdahi, and S. Y. Ohm. 1999. High-level synthesis of recoverable VLSI microarchitectures. IEEE Trans. Very Large Scale Integr. Syst. 7, 4 (Dec. 1999), 401--410.
[25]
K. K. Bourdelle, S. Chaudhry, and J. Chu. 2002. The effect of triple well implant dose on performance of NMOS transistors. IEEE Trans. Electron Devices 49, 3 (March 2002), 521--524.
[26]
K. A. Bowman, J. W. Tschanz, S. L. L. Lu, P. A. Aseron, M. M. Khellah, A. Raychowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik, and V. K. De. 2011. A 45 nm resilient microprocessor core for dynamic variation tolerance. IEEE J. Solid-State Circuits 46, 1 (Jan. 2011), 194--208.
[27]
G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz. 2004. Application-level checkpointing for shared memory programs. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI’04). ACM, New York, NY. 235--247.
[28]
D. T. Brown. 1960. Error detecting and correcting binary codes for arithmetic operations. IRE Trans. Electron. Comput. EC-9, 3 (Sept. 1960), 333--337.
[29]
M. Bruel. 1995. Silicon on insulator material technology. Electron. Lett. 31, 14 (1995), 1201--1202.
[30]
D. Burnett, C. Lage, and A. Bormann. 1993. Soft-error-rate improvement in advanced BiCMOS SRAMs. In Proceedings of the 31st Annual Proceedings of the International Reliability Physics Symposium, 1993. 156--160.
[31]
J. A. Cairns and J. F. Ziegler. 1985. Coated ceramic substrates for mounting integrated circuits. (July 1985). Patent No. 4528212, Filed July 22, 1982, Issued July 9, 1985.
[32]
J. A. Cairns and J. F. Ziegler. 1989. Coated ceramic substrates for mounting integrated circuits and methods of coating such substrates. (March 1989). Patent No. 0099570, Filed July 19, 1983, Issued Mar. 29, 1989.
[33]
T. Calin, M. Nicolaidis, and R. Velazco. 1996. Upset hardened memory design for submicron CMOS technology. IEEE Trans. Nucl. Sci. 43, 6 (1996), 2874--2878.
[34]
E. H. Cannon, D. D. Reinhardt, M. S. Gordon, and P. S. Makowenskyj. 2004. SRAM SER in 90, 130 and 180 nm bulk and SOI technologies. In Proceedings of the 42nd Annual IEEE International Reliability Physics Symposium Proceedings, 2004. 300--304.
[35]
P. M. Carter and B. R. Wilkins. 1987. Influences on soft error rates in static RAMs. IEEE J. Solid-State Circuits 22, 3 (June 1987), 430--436.
[36]
J. Chang, G. A. Reis, and D. I. August. 2006. Automatic instruction-level software-only recovery. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). 83--92.
[37]
C.-L. Chen. 1989. Double error correction - triple error detection code for a memory. (Aug. 1989). Patent No. 0107038, Filed Sept. 20, 1983, Issued Aug. 23, 1989.
[38]
L. Chen and A. Avižienis. 1995. N-version programming: A fault-tolerance approach to rellablllty of software operatlon. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, Highlights from 25 Years. 113--119.
[39]
P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (June 1994), 145--185.
[40]
E. Cheng, S. Mirkhani, L. G. Szafaryn, C.-Y. Cher, H. Cho, K. Skadron, M. R. Stan, K. Lilja, J. A. Abraham, P. Bose, and S. Mitra. 2016a. CLEAR: Cross-layer exploration for architecting resilience - combining hardware and software techniques to tolerate soft errors in processor cores. CoRR abs/1604.03062 (2016). http://arxiv.org/abs/1604.03062
[41]
E. Cheng, S. Mirkhani, L. G. Szafaryn, C.-Y. Cher, H. Cho, K. Skadron, M. R. Stan, K. Lilja, J. A. Abraham, P. Bose, and S. Mitra. 2016b. CLEAR: Cross-layer exploration for architecting resilience - combining hardware and software techniques to tolerate soft errors in processor cores. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY. Article 68, 6 pages.
[42]
A. L. Crouch, M. D. Pressly, J. C. Circello, and R. Duerden. 1997. Serial scan chain architecture for a data processing system and method of operation. (July 1997). Patent No. 5592493, Filed Sept. 13, 1994, Issued July 1, 1997.
[43]
M. de Kruijf, S. Nomura, and K. Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY. 497--508.
[44]
D. A. G. de Oliveira, L. L. Pilla, T. Santini, and P. Rech. 2016. Evaluation and mitigation of radiation-induced soft errors in graphics processing units. IEEE Trans. Comput. 65, 3 (March 2016), 791--804.
[45]
T. Dell. 1997. A white paper on the benefits of Chipkill-correct ECC for PC server main memory. In IBM Microelectronics Division.
[46]
S. E. Diehl, A. Ochoa, Jr., P. V. Dressendorfer, P. Koga, and W. A. Kolasinski. 1982. Error analysis and prevention of cosmic ion-induced soft errors in static CMOS RAMs. IEEE Trans. Nucl. Sci. 29 (Dec. 1982), 2032--2039.
[47]
A. Dixit and A. Wood. 2011. The impact of new technology on soft error rates. In 2011 IEEE International Reliability Physics Symposium (IRPS’11). 5B.4.1--5B.4.7.
[48]
P. E. Dodd and F. W. Sexton. 1995. Critical charge concepts for CMOS SRAMs. IEEE Trans. Nucl. Sci. 42, 6 (Dec. 1995), 1764--1771.
[49]
J. G. Dooley. 1994. Seu-immune latch for gate array, standard cell, and other asic applications. (May 1994). Patent No. 5311070, Filed June 26, 1992, Issued May 10, 1994.
[50]
J. Duell. 2003. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. Technical Report LBNL-54941. Lawrence Berkeley National Laboratory, Berkeley, CA.
[51]
N. El-Sayed and B. Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). 1--12.
[52]
K. Endo. 2008. Enhancing SRAM cell performance by using independent double-gate FinFET. In 2008 IEEE International Electron Devices Meeting (IEDM'08). 1--4.
[53]
D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36’03). IEEE Computer Society. 7--. http://dl.acm.org/citation.cfm?id=956417.956571
[54]
Y. P. Fang and A. S. Oates. 2011. Neutron-induced charge collection simulation of bulk FinFET SRAMs compared with conventional planar SRAMs. IEEE Trans. Device Mat. Reliabil. 11, 4 (Dec. 2011), 551--554.
[55]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. SIGARCH Comput. Archit. News 38, 1 (March 2010), 385--396.
[56]
S. Feng, S. Gupta, A. Ansari, S. A. Mahlke, and D. I. August. 2011. Encore: Low-cost, fine-grained transient fault recovery. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44’11). ACM, New York, NY. 398--409.
[57]
M. J. Fremont. 1987. Method and apparatus for fault recovery within a computing system. (Oct. 1987). Patent No. 4703481, Filed Aug. 16, 1985, Issued Oct. 27, 1987.
[58]
X. Fu, T. Li, and J. A. B. Fortes. 2006. Sim-SODA: A unified framework for architectural level software reliability analysis. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation.
[59]
E. Fujiwara and D. K. Pradhan. 1990. Error-control coding in computers. Computer 23, 7 (1990), 63--72.
[60]
J. Furuta, C. Hamanaka, K. Kobayashi, and H. Onodera. 2010. A 65nm bistable cross-coupled dual modular redundancy flip-flop capable of protecting soft errors on the c-element. In Proceedings of the 2010 Symposium on VLSI Circuits. 123--124.
[61]
H. L. Garner. 1966. Error codes for arithmetic operations. IEEE Trans. Electron. Comput. EC-15, 5 (Oct. 1966), 763--770.
[62]
B. Gill, N. Seifert, and V. Zia. 2009. Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node. In Proceedings of the 2009 IEEE International Reliability Physics Symposium. 199--205.
[63]
J. N. Glosli, D. F. Richards, K. J. Caspersen, R. E. Rudd, J. A. Gunnels, and F. H. Streitz. 2007. Extending stability beyond CPU millennium: A micron-scale atomistic simulation of Kelvin-Helmholtz instability. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY. Article 58, 11 pages.
[64]
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM, New York, NY. 98--109.
[65]
L. B. Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, P. Rech, and M. S. Reorda. 2014. GPGPUs: How to combine high computational power with high reliability. In Proceedings of the 2014 Design, Automation Test in Europe Conference Exhibition (DATE’14). 1--9.
[66]
W. Gu, Z. Kalbarczyk, Ravishankar, K. Iyer, and Z. Yang. 2003. Characterization of Linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks, 2003. 459--468.
[67]
M. S. Gupta, J. A. Rivers, P. Bose, G. Y. Wei, and D. Brooks. 2009. Tribeca: Design for PVT variations with local recovery and fine-grained adaptation. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 435--446.
[68]
R. W. Hamming. 1950. Error detecting and error correcting codes. Bell Syst. Tech. J. 26, 2 (1950), 147--160.
[69]
J.-J. Han, D.-H. Hwang, B.-K. Kim, and B.-K. Lee. 2001. Semiconductor device having triple-well. (May 2001). Patent No. 6225199, Filed July 7, 1999, Issued May 1, 2001.
[70]
S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. 2012. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII’12). ACM, New York, NY. 123--134.
[71]
K. J. Hass and J. W. Ambles. 1999. Single event transients in deep submicron CMOS. In 42nd Midwest Symposium on Circuits and Systems, 1999. Vol. 1. 122--125.
[72]
J. D. Hayden. 1994. A quadruple well, quadruple polysilicon BiCMOS process for fast 16 Mb SRAM’s. IEEE Trans. Electron Devices 41, 12 (1994), 2318--2325.
[73]
P. Hazucha and C. Svensson. 2000. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. IEEE Trans. Nucl. Sci. 47 (Dec. 2000), 2586--2594.
[74]
J. L. Hennessy and D. A. Patterson. 2006. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco, CA.
[75]
R. Ho, K. W. Mai, and M. A. Horowitz. 2001. The future of wires. Proc. IEEE 89, 4 (April 2001), 490--504.
[76]
T. W. Houston. 1990. Memory cell with improved single event upset rate reduction circuitry. (Sept. 1990). Patent No. 4956814, Filed Sept. 30, 1988, Issued Sept. 11, 1990.
[77]
A. W. Hsingya, J.-C. Young, and S. K. Ming. 2013. Triple well flash memory cell and fabrication process. (Feb. 2013). Patent No. 0810667, Filed May 30, 1997, Issued Feb. 27, 2013.
[78]
K.-H. Huang and J. A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (June 1984), 518--528.
[79]
X. Huang. 1999. Sub 50-nm FinFET: PMOS. In IEDM. 67--70.
[80]
D. Hunt and P. Marinos. 1987. A general purpose cache-aided rollback error recovery (CARER) technique. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems. 170--175.
[81]
E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba. 2010. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. IEEE Trans. Electron Devices 57, 7 (2010), 1527--1538.
[82]
K. Itoh. 1980. A single 5V 64K dynamic RAM. In ISSCC. Vol. XXIII. 228--229.
[83]
R. K. Iyer, N. M. Nakka, Z. T. Kalbarczyk, and S. Mitra. 2005. Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25, 6 (Nov. 2005), 18--29.
[84]
K. Johansson, M. Ohlsson, N. Olsson, J. Blomgren, and P. U. Renberg. 1999. Neutron induced single-word multiple-bit upset in SRAM. IEEE Trans. Nucl. Sci. 46, 6 (Dec. 1999), 1427--1433.
[85]
J.-Y. Jou and J. A. Abraham. 1988. Fault-tolerant FFT networks. IEEE Trans. Comput. 37, 5 (May 1988), 548--561.
[86]
A. B. Kahng. 2013. The ITRS design technology and system drivers roadmap: Process and status. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY. Article 34, 6 pages.
[87]
A. Kar-Roy, M. Racanelli, and J. Zhang. 2006. Deep N wells in triple well structures and method for fabricating same. (May 2006). Patent No. 7052966, Filed Apr. 9, 2003, Issued May 30, 2006.
[88]
T. Karnik. 2002. Selective node engineering for chip-level soft error rate improvement {in CMOS}. In VLSIC. 204--205.
[89]
L. Hsiao-Heng Kelin, L. Klas, B. Mounaim, R. Prasanthi, I. R. Linscott, U. S. Inan, and M. Subhasish. 2010. LEAP: Layout design through error-aware transistor positioning for soft-error resilient sequential cell design. In Proceedings of the 2010 IEEE International Reliability Physics Symposium (IRPS’10). 203--212.
[90]
G. H. Kemmetmueller. 1980. RAM error correction using two dimensional parity checking. (Jan. 1980). Patent No. 4183463, Filed July 31, 1978, Issued Jan. 15, 1980.
[91]
D. S. Khudia and S. Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 319--330.
[92]
D. R. Kim. 1977. Longitudinal parity generator for use with a memory. (April 1977). Patent No. 4016409, Filed Mar. 1, 1976, Issued Apr. 5, 1977.
[93]
K. S. Kim and L. J. Schultz. 1996. Method and apparatus for multi-frequency, multi-phase scan chain (April 1996). Patent No. 5504756, Filed Sept. 30, 1993, Issued Apr. 2, 1996.
[94]
T.-J. King. 2005. FinFETs for nanoscale CMOS digital integrated circuits. In Proceedings of the 2005 IEEE/ACM International Conference on Computer-aided Design (ICCAD’05). IEEE Computer Society. 207--210. http://dl.acm.org/citation.cfm?id=1129601.1129631
[95]
J. S. Klecka, W. F. Bruckert, and R. L. Jardine. 2002. Error self-checking and recovery using lock-step processor pair architecture. (May 2002). Patent No. 6393582, Filed Dec. 10, 1998, Issued May 21, 2002.
[96]
P. M. Kogge, K. T. Truong, D. A. Rickard, and R. L. Schoenike. 1990. Checkpoint retry mechanism. (March 1990). Patent No. 4912707, Filed Aug. 23, 1988, Issued Mar. 27, 1990.
[97]
M. Kohara, Y. Mashiko, K. Nakasaki, and M. Nunoshita. 1990. Mechanism of electromigration in ceramic packages induced by chip-coating polyimide. IEEE Trans. Compon. Hybrids Manufact. Technol. 13, 4 (1990), 873--878.
[98]
I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, and B. R. de Supinski. 2014. Evaluating user-level fault tolerance for MPI applications. In Proceedings of the 21st European MPI Users’ Group Meeting (EuroMPI/ASIA’14). ACM, New York, NY, Article 57, 6 pages.
[99]
H. H. K. Lee, K. Lilja, M. Bounasser, I. Linscott, and U. Inan. 2011. Design framework for soft-error-resilient sequential cells. IEEE Trans. Nucl. Sci. 58, 6 (Dec. 2011), 3026--3032.
[100]
N.-C. Lee. 2000. Lead-free soldering and low alpha solders for Wafer level interconnects. In Proceedings of SMTA International Conference.
[101]
C. C. J. Li and W. K. Fuchs. 1990. CATCH-compiler-assisted techniques for checkpointing. In Proceedings of the 20th International Symposium Fault-Tolerant Computing, 1990 (FTCS-20’90), Digest of Papers. 74--81.
[102]
T. Li, R. Ragel, and S. Parameswaran. 2012. Reli: Hardware/software checkpoint and recovery scheme for embedded processors. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’12). 875--880.
[103]
T. Li, M. Shafique, J. A. Ambrose, S. Rehman, J. Henkel, and S. Parameswaran. 2013a. RASTER: Runtime adaptive spatial/temporal error resiliency for embedded processors. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY. Article 62, 7 pages.
[104]
T. Li, M. Shafique, S. Rehman, J. A. Ambrose, J. Henkel, and S. Parameswaran. 2013b. DHASER: Dynamic heterogeneous adaptation for soft-error resiliency in ASIP-based multi-core systems. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’13). 646--653.
[105]
X. Li, S. V. Adve, P. Bose, and J. A. Rivers. 2005. SoftArch: An architecture-level tool for modeling and analyzing soft errors. In 2005 International Conference on Dependable Systems and Networks (DSN’05). 496--505.
[106]
K. Lilja, M. Bounasser, S. J. Wen, R. Wong, J. Holst, N. Gaspard, S. Jagannathan, D. Loveless, and B. Bhuva. 2013. Single-event performance and layout optimization of flip-flops in a 28-nm bulk technology. IEEE Trans. Nucl. Sci. 60, 4 (Aug. 2013), 2782--2788.
[107]
R. Lim. 2010. Investigation into Lead-Free Solder in Australian Defence Force Applications. Technical Report DSTO-TN-0970. DSTO Defence Science and Technology Organization, Air Vehicles Division.
[108]
D. Lipetz and E. Schwarz. 2011. Self checking in current floating-point units. In 2011 20th IEEE Symposium on Computer Arithmetic (ARITH’11). 73--76.
[109]
M. N. Liu and S. Whitaker. 1992. Low power SEU immune CMOS memory circuits. IEEE Trans. Nucl. Sci. 39 (Dec. 1992), 1679--1684.
[110]
T. D. Loveless, S. Jagannathan, T. Reece, J. Chetia, B. L. Bhuva, M. W. McCurdy, L. W. Massengill, S. J. Wen, R. Wong, and D. Rennie. 2011. Neutron- and proton-induced single event upsets for D- and DICE-flip/flop designs at a 40 nm technology node. IEEE Trans. Nucl. Sci. 58, 3 (June 2011), 1008--1014.
[111]
D. J. C. MacKay. 2002. Information Theory, Inference 8 Learning Algorithms. Cambridge University Press, New York, NY.
[112]
J. Maiz, S. Hareland, K. Zhang, and P. Armstrong. 2003. Characterization of multi-bit soft error events in advanced SRAMs. In IEEE International Electron Devices Meeting, 2003 (IEDM’03), Technical Digest. 21.4.1--21.4.4.
[113]
G. Maki, K. Haas, S. Quan, and J. Murguia. 2003. Conflict free radiation tolerant storage cell. (June 2003). Patent No. 6573773, Filed Feb. 2, 2001, Issued June 3, 2003.
[114]
B. E. Mann, P. J. Trasatti, M. D. Carlozzi, J. A. Ywoskus, and E. J. McGrath. 1999. Loosely coupled mass storage computer cluster. (Jan. 1999). Patent No. 5862312, Filed Oct. 24, 1995, Issued Jan. 19, 1999.
[115]
J. T. Marino Jr. 1981. DES Parity check system. (April 1981). Patent No. 4262358, Filed June 28, 1979, Issued Apr. 14, 1981.
[116]
D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, A. J. Miller, and M. Upton. 2002. Hyper-threading technology architecture and microarchitecture. Intel Technol. J. 6, 1 (Feb. 2002), 4--15.
[117]
C. D. Martino, W. Kramer, Z. Kalbarczyk, and R. Iyer. 2015. Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 25--36.
[118]
J. L. Massey and O. N. García. 1972. Error-correcting codes in computer arithmetic. In Advances in Information Systems Science. Vol. 4. Springer US, Boston, MA. 273--326.
[119]
R. N. Master, S. A. Anand, S. Parthasarathy, and Y. C. Mui. 2007. Lead-free semiconductor package. (May 2007). Patent No. 7215030, Filed June 27, 2005, Issued May 8, 2007.
[120]
T. C. May and M. H. Woods. 1979. Alpha-particle-induced soft errors in dynamic memories. IEEE Trans. Electron Devices 26, 1 (1979), 2--9.
[121]
K. L. McMillan. 1993. Symbolic model checking. In Symbolic Model Checking. Springer US. 25--60.
[122]
A. Meixner, M. E. Bauer, and D. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40’07). IEEE Computer Society. 210--222.
[123]
A. Meixner and D. J. Sorin. 2007. Error detection using dynamic dataflow verification. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07). IEEE Computer Society. 104--118.
[124]
G. C. Messenger. 1982. Collection of charge on junction nodes from ion tracks. IEEE Trans. Nucl. Sci. 29, 6 (1982), 2024--2031.
[125]
A. Mishra and P. Banerjee. 1999. An algorithm based error detection scheme for the multigrid algorithm. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers. 12--19.
[126]
A. Mishra and P. Banerjee. 2003. An algorithm-based error detection scheme for the multigrid method. IEEE Trans. Comput. 52, 9 (2003), 1089--1099.
[127]
N. Miskov-Zivanov and D. Marculescu. 2006. Circuit reliability analysis using symbolic techniques. IEEE Trans. Computer-Aided Design Integr. Circuits Syst. 25, 12 (Dec. 2006), 2638--2649.
[128]
N. Miskov-Zivanov and D. Marculescu. 2010. Formal modeling and reasoning for reliability analysis. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY. 531--536.
[129]
S. Mitra, T. Karnik, N. Seifert, and M. Zhang. 2005a. Logic soft errors in sub-65Nm technologies design and CAD challenges. In Proceedings of the 42nd Annual Design Automation Conference (DAC’05). ACM, New York, NY. 2--4.
[130]
S. Mitra and E. J. McCluskey. 2000. Which concurrent error detection scheme to choose? In Proceedings of the 2012 IEEE International Test Conference. 985.
[131]
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim. 2005b. Robust system design with built-in soft-error resilience. Computer 38, 2 (2005), 43--52.
[132]
S. Mukherjee. 2008. Architecture Design for Soft Errors. Morgan Kaufmann Publishers, San Francisco, CA.
[133]
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. 2005. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. 243--247.
[134]
S. S. Mukherjee, J. S. Emer, S. K. Reinhardt, and C. T. Weaver. 2008. Implementing check instructions in each thread within a redundant multithreading environments. (April 2008). Patent No. 7353365, Filed Sept. 29, 2004, Issued Apr. 1, 2008.
[135]
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. 2003a. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36’03). IEEE Computer Society. 29--. http://dl.acm.org/citation.cfm?id=956417.956570
[136]
S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. 2003b. Measuring architectural vulnerability factors. IEEE Micro 23, 6 (Nov. 2003), 70--75.
[137]
D. E. Muller and W. S. Bartky. 1959. A theory of asynchronous circuits. In Proceedings of International Symposium on the Theory of Switching. Harvard University Press.
[138]
P. C. Murley and G. R. Srinivasan. 1996. Soft-error Monte Carlo modeling program, SEMM. IBM J. Res. Dev. 40, 1 (Jan. 1996), 109--118.
[139]
A. A. Nair, S. Eyerman, J. Chen, L. K. John, and L. Eeckhout. 2015. Mechanistic modeling of architectural vulnerability factor. ACM Trans. Comput. Syst. 32, 4, Article 11 (Jan. 2015), 32 pages.
[140]
A. A. Nair, S. Eyerman, L. Eeckhout, and L. Kurian John. 2012. A first-order mechanistic model for architectural vulnerability factor. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society. 273--284. http://dl.acm.org/citation.cfm?id=2337159.2337191
[141]
N. Nakka, Z. Kalbarczyk, R. K. Iyer, and J. Xu. 2004. An architectural framework for providing reliability and security support. In Proceedings of the 2004 International Conference on Dependable Systems and Networks. 585--594.
[142]
M. Nicolaidis, R. O. Duarte, S. Manich, and J. Figueras. 1997. Fault-secure parity prediction arithmetic operators. IEEE Des. Test Comput. 14, 2 (1997), 60--71.
[143]
Nitin, I. Pomeranz, and T. N. Vijaykumar. 2015. FaultHound: Value-locality-based soft-fault tolerance. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY. 668--681.
[144]
E. Normand, J. L. Wert, H. Quinn, T. D. Fairbanks, S. Michalak, G. Grider, P. Iwanchuk, J. Morrison, S. Wender, and S. Johnson. 2010. First record of single-event upset on ground, cray-1 computer at Los Alamos in 1976. IEEE Trans. Nucl. Sci. 57, 6 (2010), 3114--3120.
[145]
Nvidia. 2009. Fermi Compute Architecture Whitepaper.
[146]
N. Oh, P. P. Shirvani, and E. J. McCluskey. 2002a. Control-flow checking by software signatures. IEEE Trans. Reliabil. 51, 1 (March 2002), 111--122.
[147]
N. Oh, P. P. Shirvani, and E. J. McCluskey. 2002b. Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliabil. 51, 1 (March 2002), 63--75.
[148]
P. Oldiges, R. Dennard, D. Heidel, T. Ning, K. Rodbell, H. Tang, M. Gordon, and L. Wissel. 2009. Technologies to further reduce soft error susceptibility in SOI. In Proceedings of the 2009 IEEE International Electron Devices Meeting (IEDM’09). 1--4.
[149]
K. Osada, Y. Saitoh, E. Ibe, and K. Ishibashi. 2003. 16.7-fA/cell tunnel-leakage-suppressed 16-Mb SRAM for handling cosmic-ray-induced multierrors. IEEE J. Solid-State Circuits 38, 11 (Nov. 2003), 1952--1957.
[150]
J. M. Palau, G. Hubert, K. Coulie, B. Sagnes, M. C. Calvet, and S. Fourtine. 2001. Device simulation study of the SEU sensitivity of SRAMs to internal ion tracks generated by nuclear reactions. IEEE Trans. Nucl. Sci. 48, 2 (April 2001), 225--231.
[151]
D. J. Palframan, N. S. Kim, and M. H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 49--59.
[152]
J. H. Patel and L. Y. Fung. 1982. Concurrent error detection in ALU’s by recomputing with shifted operands. IEEE Trans. Comput. 31, 7 (July 1982), 589--595.
[153]
N. Patil, Jie Deng, S. Mitra, and H.-S. P. Wong. 2009. Circuit-level performance benchmarking and scalability analysis of carbon nanotube transistor circuits. IEEE Trans. Nanotechnol. 8, 1 (2009), 37--45.
[154]
D. A. Patterson, G. Gibson, and R. H. Katz. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (SIGMOD’88). ACM, New York, NY. 109--116.
[155]
W. W. Peterson and E. J. Weldon. 1972. Error-Correcting Codes. MIT Press, Cambridge, MA.
[156]
J. S. Plank, M. Beck, G. Kingsley, and K. Li. 1995. Libckpt: Transparent checkpointing under unix. In Proceedings of the USENIX 1995 Technical Conference Proceedings (TCON’95). USENIX Association, Berkeley, CA. 18--18. http://dl.acm.org/citation.cfm?id=1267411.1267429
[157]
M. Poolakkaparambil, J. Mathew, A. M. Jabir, and S. P. Mohanty. 2012. An investigation of concurrent error detection over binary Galois fields in CNTFET and QCA technologies. In ISVLSI. 141--146.
[158]
E. Pop, S. Dutta, D. Estrada, and Albert Liao. 2009. Avalanche, joule breakdown and hysteresis in carbon nanotube transistors. In Proceedings of the 2009 IEEE International Reliability Physics Symposium. 405--408.
[159]
D. K. Pradhan (Ed.). 1996. Fault-Tolerant Computer System Design. Prentice-Hall, Upper Saddle River, NJ.
[160]
P. Prata and J. G. Silva. 1999. Algorithm based fault tolerance versus result-checking for matrix computations. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers. 4--11.
[161]
M. Prvulovic, Z. Zhang, and J. Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society. 111--122. http://dl.acm.org/citation.cfm?id=545215.545228
[162]
S. Raasch, A. Biswas, J. Stephan, P. Racunas, and J. Emer. 2015. A fast and accurate analytical technique to compute the AVF of sequential bits in a processor. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48’15). ACM, New York, NY. 738--749.
[163]
J. M. Rabaey. 1996. Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Upper Saddle River, NJ.
[164]
P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee. 2007. Perturbation-based fault screening. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture. 169--180.
[165]
R. G. Ragel and S. Parameswaran. 2006. IMPRES: Integrated monitoring for processor reliability and security. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY. 502--505.
[166]
B. Ramkumar and V. Strumpen. 1997. Portable checkpointing for heterogeneous architectures. In Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing, 1997 (FTCS-27’97). Digest of Papers. 58--67.
[167]
J. Ray, J. C. Hoe, and B. Falsafi. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO 34’01). IEEE Computer Society. 214--224. http://dl.acm.org/citation.cfm?id=563998.564027
[168]
A. L. N. Reddy and P. Banerjee. 1990. Algorithm-based fault detection for signal processing applications. IEEE Trans. Comput. 39, 10 (1990), 1304--1308.
[169]
V. K. Reddy, A. S. Al-Zawawi, and E. Rotenberg. 2006. Assertion-based microarchitecture design for improved fault tolerance. In Proceedings of the 2006 International Conference on Computer Design. 362--369.
[170]
S. Rehman, M. Shafique, and J. Henkel. 2012. Instruction scheduling for reliability-aware compilation. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY. 1292--1300.
[171]
S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In Proceedings of the 7th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’11). ACM, New York, NY. 237--246.
[172]
S. K. Reinhardt and S. S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY. 25--36.
[173]
S. K. Reinhardt, S. S. Mukherjee, J. S. Emer, and C. T. Weaver. 2008. Managing external memory updates for fault detection in redundant multithreading systems using speculative memory support. (Oct. 2008). Patent No. 7444497, Filed Dec. 30, 2003, Issued Oct. 28, 2008.
[174]
G. A. Reis, J. Chang, and D. I. August. 2007. Automatic instruction-level software-only recovery. IEEE Micro 27, 1 (Jan. 2007), 36--47.
[175]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society. 243--254.
[176]
M. W. Roberson. 1998. Soft error rates in solder bumped packaging. In Proceedings of the 4th International Symposium on Advanced Packaging Materials, 1998. 111--116.
[177]
L. R. Rockett. 1988. An SEU-hardened CMOS data latch design. IEEE Trans. Nucl. Sci. 35, 6 (Dec. 1988), 1682--1687.
[178]
L. R. Rockett. 1992. Simulated SEU hardened scaled CMOS SRAM cell design using gated resistors. IEEE Trans. Nucl. Sci. 39, 5 (Oct. 1992), 1532--1541.
[179]
K. P. Rodbell, D. F. Heidel, J. A. Pellish, P. W. Marshall, H. H. K. Tang, C. E. Murray, K. A. LaBel, M. S. Gordon, K. G. Stawiasz, J. R. Schwank, M. D. Berg, H. S. Kim, M. R. Friendlich, A. M. Phan, and C. M. Seidleck. 2011. 32 and 45 nm radiation-hardened-by-design (RHBD) SOI latches. IEEE Trans. Nucl. Sci. 58, 6 (Dec. 2011), 2702--2710.
[180]
E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers. 84--91.
[181]
A. Roy-Chowdhury and P. Banerjee. 1996. Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE Trans. Comput. 45, 11 (1996), 1239--1247.
[182]
S. K. Sahoo, J. Criswell, C. Geigle, and V. Adve. 2013. Using likely invariants for automated software fault localization. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY. 139--152.
[183]
P. N. Sanda, J. W. Kellington, P. Kudva, R. Kalla, R. B. McBeth, J. Ackaret, R. Lockwood, J. Schumann, and C. R. Jones. 2008. Soft-error resilience of the IBM POWER6 processor. IBM J. Res. Dev. 52, 3 (May 2008), 275--284.
[184]
M. A. Schuette and J. P. Shen. 1987. Processor control flow monitoring using signatured instruction streams. IEEE Trans. Comput. C-36, 3 (March 1987), 264--276.
[185]
N. Seifert, V. Ambrose, B. Gill, Q. Shi, R. Allmon, C. Recchia, S. Mukherjee, N. Nassif, J. Krause, J. Pickholtz, and A. Balasubramanian. 2010. On the radiation-induced soft error performance of hardened sequential elements in advanced bulk CMOS technologies. In Proceedings of the 2010 IEEE International Reliability Physics Symposium (IRPS’10). 188--197.
[186]
N. Seifert, S. Jahinuzzaman, J. Velamala, R. Ascazubi, N. Patel, B. Gill, J. Basile, and J. Hicks. 2015. Soft error rate improvements in 14-nm technology featuring second-generation 3d tri-gate transistors. IEEE Trans. Nucl. Sci. 62, 6 (Dec. 2015), 2570--2577.
[187]
N. Seifert, P. Shipley, M. D. Pant, V. Ambrose, and B. Gill. 2005. Radiation-induced clock jitter and race. In Proceedings of the 43rd Annual IEEE International Reliability Physics Symposium, 2005. 215--222.
[188]
N. Seifert, P. Slankard, M. Kirsch, B. Narasimham, V. Zia, C. Brookreson, A. Vo, S. Mitra, B. Gill, and J. Maiz. 2006. Radiation-induced soft error rates of advanced CMOS bulk devices. In Proceedings of the 2006 IEEE International Reliability Physics Symposium Proceedings. 217--225.
[189]
N. Seifert and N. Tam. 2004. Timing vulnerability factors of sequentials. IEEE Trans. Device Materials Reliabil. 4, 3 (2004), 516--522.
[190]
F. F. Sellers, M. Xiao, and L. W. Bearnson. 1968. Error Detecting Logic for Digital Computers. McGraw-Hill, New York, NY.
[191]
S. A. Seshia, W. Li, and S. Mitra. 2007. Verification-guided soft error resilience. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’07). EDA Consortium, San Jose, CA. 1442--1447. http://dl.acm.org/citation.cfm?id=1266366.1266681
[192]
C. E. Shannon. 2001. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5, 1 (Jan. 2001), 3--55.
[193]
B. Shim, S. R. Sridhara, and N. R. Shanbhag. 2004. Reliable low-power digital signal processing via reduced precision redundancy. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 12, 5 (May 2004), 497--510.
[194]
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks, 2002 (DSN’02). 389--398.
[195]
R. L. Shuler, C. Kouba, and P. M. O’Neill. 2005. SEU performance of TAG based flip-flops. IEEE Trans. Nucl. Sci. 52 (Dec. 2005), 2550--2553.
[196]
R. L. Shuler Jr. 2002a. Method and apparatus for reducing the vulnerability of latches to single event upsets. (April 2002). Patent No. 6377097, Filed Mar. 13, 2000, Issued Apr. 23, 2002.
[197]
R. L. Shuler Jr. 2002b. Method and apparatus for reducing the vulnerability of latches to single event upsets. (Dec. 2002). Patent No. 6492857, Filed Apr. 20, 2001, Issued Dec. 10, 2002.
[198]
D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems (3rd ed.): Design and Evaluation. A. K. Peters, Natick, MA.
[199]
T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. 1999. IBM’s S/390 G5 microprocessor design. IEEE Micro 19, 2 (1999), 12--23.
[200]
J. Snyder and J. Larson. 2007. CMOS device with zero soft error rate. (April 2007). Patent No. 20070080406, Filed Oct. 12, 2006, Issued Apr. 12, 2007.
[201]
J. P. Snyder and J. M. Larson. 2011. Method of manufacturing a cmos device with zero soft error rate. (Feb. 2011). Patent No. 20110034016, Filed Oct. 20, 2010, Issued Feb. 10, 2011.
[202]
G. S. Sohi, M. Franklin, and K. K. Saluja. 1989. A study of time-redundant fault tolerance techniques for high-performance pipelined computers. In Proceedings of the 19th International Symposium on Fault-Tolerant Computing, 1989 (FTCS-19’89). Digest of Papers. 436--443.
[203]
D. J. Sorin. 2009. Fault tolerant computer architecture. Synthesis Lectures on Computer Architecture 4, 1 (2009), 1--104.
[204]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, Washington, DC. 123--134. http://dl.acm.org/citation.cfm?id=545215.545229
[205]
R. F. Sproull, I. E. Sutherland, and C. E. Molnar. 1994. The counterflow pipeline processor architecture. IEEE Des. Test 11, 3 (July 1994), 48--59.
[206]
V. Sridharan and D. R. Kaeli. 2009. Eliminating microarchitectural dependency from architectural vulnerability. In Proceedings of the 2009 IEEE 15th International Symposium on High Performance Computer Architecture. 117--128.
[207]
K. Sundaramoorthy, Z. Purser, and E. Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX’00). ACM, New York, NY. 257--268.
[208]
A. Taber and E. Normand. 1993. Single event upset in avionics. IEEE Trans. Nucl. Sci. 40 (April 1993), 120--126.
[209]
D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 331--342.
[210]
S. J. Trans, A. R. M. Verschueren, and C. Dekker. 1998. Room-temperature transistor based on a single carbon nanotube. Nature 393, 5545 (1998), 49--52. Issue 6680.
[211]
R. R. Tummala, E. J. Rymaszewski, and A. G. Klopfenstein. 1997. Microelectronics Packaging Handbook: Semiconductor Packaging. Springer.
[212]
T. Uemura, T. Kato, H. Matsuyama, and M. Hashimoto. 2015. Soft error immune latch design for 20 nm bulk CMOS. In Proceedings of the 2015 IEEE International Reliability Physics Symposium. SE.4.1--SE.4.6.
[213]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng. 2002. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society. 87--98.
[214]
J. T. Wallmark and S. M. Marcus. 1962. Minimum size and maximum packing density of nonredundant semiconductor devices. Proc. IRE 50, 3 (March 1962), 286--298.
[215]
F. Wang, Y. Xie, K. Bernstein, and Y. Luo. 2006. Dependability analysis of nano-scale FinFET circuits. In Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI’06).
[216]
N. J. Wang. 2007. Cost Effective Soft Error Mitigation In Microprocessors. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, Champaign, IL. Advisor(s) Sanjay J. Patel. AAI3290421.
[217]
N. J. Wang and S. J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3, 3 (July 2006), 188--201.
[218]
Y. M. Wang, Y. Huang, and W. K. Fuchs. 1993. Progressive retry for software error recovery in distributed systems. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, 1993 (FTCS-23’93). Digest of Papers. 138--144.
[219]
C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). IEEE Computer Society. 264--. http://dl.acm.org/citation.cfm?id=998680.1006723
[220]
H. T. Weaver. 1987. An SEU tolerant memory cell derived from fundamental studies of SEU mechanisms in SRAM. IEEE Trans. Nucl. Sci. 34, 6 (12 1987), 1281--1286.
[221]
S. Whitaker, J. Canaris, and K. Liu. 1991. SEU hardened memory cells for a CCSDS Reed-Solomon encoder. IEEE Trans. Nucl. Sci. 38, 6 (12 1991), 1471--1477.
[222]
S. R. Whitaker. 1992. Single event upset hardening CMOS memory circuit. (May 1992). Patent No. 5111429, Filed Nov. 6, 1990, Issued May 5, 1992.
[223]
M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, and D. R. Kaeli. 2014. Calculating architectural vulnerability factors for spatial multi-bit transient faults. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 293--305.
[224]
M. J. Y. Williams and J. B. Angell. 1973. Enhancing testability of large-scale integrated circuits via test points and additional logic. IEEE Trans. Comput. 22, 1 (Jan. 1973), 46--60.
[225]
H. S. P. Wong, J. Appenzeller, V. Derycke, R. Martel, S. Wind, and P. Avouris. 2003. Carbon nanotube field effect transistors - fabrication, device physics, and circuit implications. In 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers (ISSCC’03). 370--500 vol. 1.
[226]
A. Wood. 1999. Data integrity concepts, features, and technology. (4 1999). White paper, Tandem Division, Compaq Computer Corporation.
[227]
J. Xu. 2003. Software and Hardware Techniques for Masking Security Vulnerabilities. Ph.D. Dissertation. Champaign, IL, USA. Advisor(s) Iyer, Ravishankar K. AAI3111659.
[228]
J. Yang and H.-W. Huang. 2000. Triple well structure. (Aug. 2000). Patent No. 6111283, Filed Feb. 1, 1999, Issued Aug. 29, 2000.
[229]
D. H. Yoon and M. Erez. 2010. Virtualized and flexible ECC for main memory. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV’10). ACM, New York, NY. 397--408.
[230]
M. Zhang. 2006. Sequential element design with built-in soft error resilience. IEEE TVLSI 14, 12 (12 2006), 1368--1378.
[231]
J. F. Ziegler. 1996. Terrestrial cosmic rays. IBM J. Res. Devel. 40, 1 (Jan. 1996), 19--39.
[232]
J. F. Ziegler and W. A. Lanford. 1979. Effect of cosmic rays on computer memories. Science 206, 4420 (1979), 776--788.

Cited By

View all
  • (2024)Concept Evolution Detecting over Feature StreamsACM Transactions on Knowledge Discovery from Data10.1145/367801218:8(1-32)Online publication date: 13-Jul-2024
  • (2024)Efficient Diagnoses of Breast Cancer Disease Using Deep Learning TechniqueProceedings of the 2024 10th International Conference on Computing and Artificial Intelligence10.1145/3669754.3669775(136-143)Online publication date: 26-Apr-2024
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 49, Issue 3
September 2017
658 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2988524
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2016
Accepted: 01 September 2016
Revised: 01 August 2016
Received: 01 December 2015
Published in CSUR Volume 49, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Processor
  2. recovery
  3. soft error

Qualifiers

  • Survey
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Concept Evolution Detecting over Feature StreamsACM Transactions on Knowledge Discovery from Data10.1145/367801218:8(1-32)Online publication date: 13-Jul-2024
  • (2024)Efficient Diagnoses of Breast Cancer Disease Using Deep Learning TechniqueProceedings of the 2024 10th International Conference on Computing and Artificial Intelligence10.1145/3669754.3669775(136-143)Online publication date: 26-Apr-2024
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
  • (2024)SHAKTI Dual Lockstep Microprocessor: Ensuring Functional Integrity for Robust Computing2024 IEEE International Conference on Contemporary Computing and Communications (InC4)10.1109/InC460750.2024.10649322(1-5)Online publication date: 15-Mar-2024
  • (2024)TCC: GPGPU Architecture for Instruction Decoder and Control Flow Error Detection2024 27th International Symposium on Design & Diagnostics of Electronic Circuits & Systems (DDECS)10.1109/DDECS60919.2024.10508915(104-109)Online publication date: 3-Apr-2024
  • (2024)Detection, characterization, and profiling DoH Malicious traffic using statistical pattern recognitionInternational Journal of Information Security10.1007/s10207-023-00790-z23:2(1293-1316)Online publication date: 1-Apr-2024
  • (2023)Design of Low-Cost Reliable and Fault-Tolerant 32-Bit One Instruction Core for Multi-Core SystemsQuality Control - An Anthology of Cases10.5772/intechopen.102823Online publication date: 18-Jan-2023
  • (2023)Predicting lifespan-extending chemical compounds for C. elegans with machine learning and biologically interpretable featuresAging10.18632/aging.20486615:13(6073-6099)Online publication date: 13-Jul-2023
  • (2023)Genetic Algorithm with Linkage LearningProceedings of the Genetic and Evolutionary Computation Conference10.1145/3583131.3590349(981-989)Online publication date: 15-Jul-2023
  • (2023)Binary grey wolf optimizer with a novel population adaptation strategy for feature selectionIET Control Theory & Applications10.1049/cth2.1249817:17(2313-2331)Online publication date: 20-May-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media