This book provides a comprehensive description of the architetural techniques to tackle the soft error problem. It covers the new methodologies for quantitative analysis of soft errors as well as novel, cost-effective architectural techniques to mitigate them. To provide readers with a better grasp of the broader problem deffinition and solution space, this book also delves into the physics of soft errors and reviews current circuit and software mitigation techniques. TABLE OF CONTENTS Chapter 1: Introduction Chapter 2: Device- and Circuit-Level Modeling, Measurement, and Mitigation Chapter 3: Architectural Vulnerability Analysis Chapter 4: Advanced Architectural Vulnerability Analysis Chapter 5: Error Coding Techniques Chapter 6: Fault Detection via Redundant Execution Chapter 7: Hardware Error Recovery Chapter 8: Software Detection and Recovery * Provides the methodologies necessary to quantify the effect of radiation-induced soft errors as well as state-of-the-art techniques to protect against them
- M. Agostinelli, J. Hicks, J. Xu, B. Woolery, K. Mistry, K. Zhang, S. Jacobs, J. Jopling, W. Yang, B. Lee, T. Raz, M. Mehalel, P. Kolar, Y. Wang, J. Sandford, D. Pivin, C. Peterson, M. DiBattista, S. Pae, M. Jones, S. Johnson, and G. Subramanian, "Erratic Fluctuations of SRAM Cache Vmin at the 90 nm Process Technology Node," in IEEE International Electron Devices Meeting (IEDM) , pp. 655-658, December 2005.Google Scholar
- H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, "A 13 GHz Fifth Generation SPARC64 Microprocessor," in IEEE Journal of Solid State Circuits , Volume 38, Issue 11, pp. 1896-1905, November 2003.Google ScholarCross Ref
- R. Baumann, "Tutorial on Soft Errors," in International Reliability Physics Symposium (IRPS) Tutorial Notes , IEEE, Dallas, Texas, USA, April 2002.Google Scholar
- R. Baumann, T. Hossain, E. Smith, S. Murata, and H. Kitagawa, "Boron as a Primary Source of Radiation in High Density DRAMs," in IEEE Symposium on VLSI , pp. 81-82, June 1995.Google Scholar
- S. Borkar, "Designing Reliable Systems fromUnreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro , Volume 25, Issue 6, pp. 10-16, November/December 2005. Google ScholarDigital Library
- D. Bossen, "CMOS Soft Errors and Server Design," in International Reliability Physics Symposium (IRPS) Tutorial Notes , IEEE, Dallas, Texas, USA, April 2002.Google Scholar
- M. W. Friedlander, A Thin Cosmic Rain: Particles from Outer Space , Harvard University Press, November 2002.Google Scholar
- S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, and C. Dai, "Impact of CMOS Process Scaling and SOI on the Soft Error Rates of Logic Processes," in Symposium on VLSI Technology Digest of Technical Papers , pp. 73-74, June 2001.Google Scholar
- M. S. Gordon, et al., "Measurement of the Flux and Energy Spectrum of Cosmic-Ray Induced Neutrons on the Ground," IEEE Transactions on Nuclear Science , Vol. 51, No. 6, Part 2, pp. 3427-3434, December 2004.Google ScholarCross Ref
- B. R. Havekort, et al., Performability Modelling: Techniques and Tools , John Wiley and Sons, 2001.Google Scholar
- P. Hazucha and C. Svensson, "Impact of CMOS Technological Scaling on the Atmospheric Neutron Soft Error Rate," IEEE Transactions on Nuclear Science , Vol. 47, No. 6, pp. 2586-2594, December 2000.Google ScholarCross Ref
- T. Karnik, P. Hazucha, and J. Patel, "Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes," IEEE Transactions on Dependable and Secure Computing , Vol. 1, No. 2, pp. 128-143, April-June 2004. Google ScholarDigital Library
- JEDEC Standard, "Measurement and Reporting of Alpha Particles and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices," JESD89 , August 2001.Google Scholar
- J. Maiz, S. Hareland, K. Zhang, and P. Armstrong, "Characterization of Multi-Bit Soft Error Events in Advanced SRAMs," Digest of International Electronic Device Meeting (IEDM) , pp. 21.4.1-21.4.4, December 2003.Google Scholar
- T. C. May and M. H. Woods, "Alpha-Particle-Induced Soft Errors in Dynamic Memories," IEEE Transactions on Electronic Devices , Vol. 26, Issue 1, pp. 2-9, January 1979.Google ScholarCross Ref
- S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender, "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer," IEEE Transactions on Device and Materials Reliability , Vol. 5, No. 3, pp. 329-335, September 2005.Google ScholarCross Ref
- E. Normand, "Single Event Upset at Ground Level," IEEE Transactions on Nuclear Science , Vol. 43, No. 6, pp. 2742-2750, December 1996.Google Scholar
- D. K. Pradhan, Fault-Tolerant Computer System Design , Prentice-Hall, 2003. Google Scholar
- G. Reis, J. Chang, N. Vachharajani, R. Rangan, D. August, and S. S. Mukherjee, "Design and Evaluation of Hybrid Fault-Detection Systems," in International Symposium on Computer Architecture (ISCA) , pp. 148-159, Madison, Wisconsin, USA, June 2005. Google Scholar
- N. Seifert, et al., "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices," in 44th Annual International Reliability Physics Symposium (IRPS) , pp. 217-225, 2006.Google Scholar
- G. R. Srinivasan, "Modeling the Cosmic-Ray-Induced Soft-Error Rate in Integrated Circuits: An Overview," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 77-89, January 1996. Google ScholarDigital Library
- J. H. Strathis, "Reliability Limits for the Gate Insulator in CMOS Technology," IBM Journal of Research and Development , Vol. 46, No. 2/3, pp. 265-286, March/May 2002. Google ScholarDigital Library
- J. Segura and C. F. Hawkins, CMOS Electronics: How ItWorks, How It Fails , Wiley-IEEE Press, 2004. Google Scholar
- H. H. K. Tang, "Nuclear Physics of Cosmic Ray Interaction with Semiconductor Materials: Particle-Induced Soft Errors from a Physicist's Perspective," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 91-108, January 1996. Google ScholarDigital Library
- Y. Tosaka, S. Satoh, K. Suzuki, T. Suguii, H. Ehara, G. A. Woffinden, and S. A. Wender, "Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS Circuits," in VLSI Symposium on VLSI Technology Digest of Technical Papers , pp. 148-149, June 1996.Google Scholar
- C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," in 31st Annual International Symposium on Computer Architecture , pp. 264-275, June 2004. Google Scholar
- A. P. Wood, "Software Reliability from the Customer View," IEEE Computer , Vol. 36, No. 8, pp. 37-42, August 2003. Google ScholarDigital Library
- J. F. Ziegler, "Terrestrial Cosmic Rays," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 19-39, January 1996. Google ScholarDigital Library
- J. F. Ziegler and W. A. Lanford, "The Effect of Cosmic Rays on Computer Memories," Science , Vol. 206, No. 776, 1979.Google Scholar
- J. F. Zielger and H. Puchner, SER-- History, Trends and Challenges , Cypress Semiconductor Corporation, 2004.Google Scholar
- M. P. Baze and S. P. Buchner, "Attenuation of Single Event Induced Pulses in CMOS Combinational Logic," IEEE Transactions on Nuclear Science , Vol. 44, No. 6, pp. 2217-2223, December 1997.Google ScholarCross Ref
- M. J. Bellido-Diaz, J. Juan-Chico, A. J. Acosta, M. Valencia, and J. L. Heurtas, "Logical Modeling of Delay Degradation Effect in Static CMOS Gates," IEE Proceedings Circuits, Devices, and Systems , Vol. 147, No. 2, pp. 107-117, April 2000.Google ScholarCross Ref
- T. Calin, M. Nicolaidis, and R. Velazco, "Upset Hardened Memory Design for Submicron CMOS Technology," IEEE Transactions on Nuclear Science , Vol. 43, No. 6, pp. 2874-2878, December 1996.Google Scholar
- E. H. Cannon, D. D. Reinhardt, M. S. Gordon, and P. S. Makowenskyj, "SRAM SER in 90, 130 and 180 nm Bulk and SOI Technologies," in Reliability Physics Symposium Proceedings, 2004. 42nd Annual. 2004 IEEE International , pp. 300-304, 25-29 April 2004.Google Scholar
- C. Constantinescu, "Neutron SER Characterization of Microprocessors," in International Conference on Dependable Systems and Networks (DSN) , pp. 754-759, July 2005. Google Scholar
- L. B. Freeman, "Critical Charge Calculations for a Bipolar SRMA Array," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 119-129, January 1996. Google ScholarDigital Library
- B. S. Gill, C. Papachristou, F. G. Wolff, and N. Seifert, "Node Sensitivity Analysis for Soft Errors in CMOS Logic," in International Test Conference , paper 37.2, pp. 1-9, November 2005.Google Scholar
- P. Hazucha, T. Karnik, J. Maiz, S. Walstra, B. Bloechel, J. Tschanz, G. Dermer, S. Hareland, P. Armstrong, and S. Borkar, "Neutron Soft Error Rate Measurements in a 90-nm CMOS Process and Scaling Trends in SRAM from 0.25-µm to 90-nm Generation," in IEDM '03 Technical Digest, IEEE International , pp. 21.5.1-21.5.4, 8-10 December, 2003.Google Scholar
- P. Hazucha, T. Karnik, S. Walstra, B. A. Bloechel, J. W. Tschanz, J. Maiz, K. Soumyanath, G. E. Dermer, S. Narenda, V. De, and S. Borkar, "Measurements and Analysis of SER-Tolerant Latch in a 90-nm Dual-V T CMOS Process," IEEE Journal of Solid-State Circuits , Vol. 39, No. 9, pp. 617-620, September 2004.Google ScholarCross Ref
- P. Hazucha and C. Svensson, "Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate," IEEE Transactions on Nuclear Science , Vol. 47, No. 6, pp. 2586-2594, December 2000.Google ScholarCross Ref
- P. Hazucha, C. Svensson, and S. A. Wender, "Cosmic-Ray Soft Error Rate Characterization of a Standard 0.6-µm CMOS Process," IEEE Journal of Solid-State Circuits , Vol. 35, No. 10, pp. 1422-1429, October 2000.Google ScholarCross Ref
- M. A. Horowitz, Timing Models for MOS Circuits , Technical Report SEL83-003, Integrated Circuits Laboratory, Stanford University, 1983. Google ScholarDigital Library
- JEDEC standard JESD89, Measurement and Reporting of Alpha Particles and Terrestrial Cosmic-Ray-Induced Soft Errors in Semiconductor Devices , August 2001.Google Scholar
- T. Karnik, P. Hazucha, and J. Patel, "Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes," IEEE Transactions on Dependable and Secure Computing , Vol. 1, No. 2, pp. 128-143, April-June 2004. Google ScholarDigital Library
- T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Errguntla, and S. Borkar, "Selective Node Engineering for Chip-Level Soft Error Rate Improvement," in 2002 Symposium on VLSI Circuits Digest of Technical Papers , pp. 204-205, June 2002.Google Scholar
- P. Liden, P. Dahlgren, R. Johansson, and J. Karlsson, "On Latching Probability of Particle Induced Transient in Combinatorial Networks," in 24th Symposium on Fault-Tolerant Computing (FTCS) , pp. 340-349, June 1994.Google Scholar
- S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, "Robust System Design with Built-In Soft-Error Resilience," Vol. 38, No. 2, pp. 43-52, IEEE Computer , February 2005. Google Scholar
- K. Mohanram and N. A. Touba, "Cost-Effective Approach for Reducing Soft Error Failure Rate in Logic Circuits," in International Test Conference , Sep. 30-Oct. 2, 2003.Google Scholar
- P. C. Murley and G. R. Srinivasan, "Soft-Error Monte Carlo Modeling Program, SEMM," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 109-118, January 1996. Google ScholarDigital Library
- E. Normand, "Single Event Upset at Ground Level," IEEE Transactions on Nuclear Science , Vol. 43, No. 6, pp. 2742-2750, December 1996.Google Scholar
- J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits , Prentice Hall, 2003. Google ScholarDigital Library
- L. Rockett, "An SEU Hardened CMOS Data Latch Design," IEEE Transactions on Nuclear Science , Vol. NS-35, No. 6, pp. 1682-1687, December 1988.Google ScholarCross Ref
- N. Seifert, P. Shipley, M. D. Pant, V. Ambrose, and B. Gill, "Radiation-Induced Clock Jitter and Race," in International Reliability Physics Symposium , pp. 215-222, April 2005.Google Scholar
- N. Seifert and N. Tam, "Timing Vulnerability Factors of Sequentials," IEEE Transactions on Device and Materials Reliability , Vol. 3, No. 4, pp. 516-522, September 2004.Google ScholarCross Ref
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic," in International Conference on Dependable Systems and Networks , pp. 389-398, June 2002. Google Scholar
- A. Taber and E. Normand, "Single Event Upset in Avionics," IEEE Transactions on Nuclear Science , Vol. 40, No. 2, pp. 120-126, April 1993.Google ScholarCross Ref
- S. Yamamoto, K. Kokuryou, Y. Okada, J. Komori, E. Murakami, K. Kubota, N. Matsuoka, and Y. Nagai, "Neutron-Induced Soft Error in Logic Devices Using Quasi-Monenergetic Neutron Beam," in 42nd Annual International Reliability Physics Symposium , Phoenix, pp. 305-309, April 2004.Google Scholar
- M. Zhang and N. R. Shanbhag, "ASoft Error Rate Analysis (SERA) Methodology," in International Conference on Computer Aided Design , pp. 111-118, November 2004. Google Scholar
- J. F. Ziegler andW. A. Lanford, "Effect of Cosmic Rays on Computer Memories," Science , Vol. 206, No. 4420, pp. 776-788, November 1979.Google ScholarCross Ref
- J. F. Zielger and H. Puchner, SER--History, Trends and Challenges , Cypress Semiconductor Corporation, 2004.Google Scholar
- A. Biswas, P. Racunas, J. Emer, and S. S. Mukherjee, "Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal," Computer Architecture Letters (CAL) , December 2007. Google Scholar
- J. A. Butts and G. Sohi, "Dynamic Dead-Instruction Detection and Elimination," in 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 199-210, October 2002. Google Scholar
- B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S. Lumetta, "Performance Characterization of a Hardware Mechanism for Dynamic Optimization," in 34th Annual International Symposium on Microarchitecture (MICRO) , pp. 16-27, December 2001. Google Scholar
- Y. Choi, A. Knies, L. Gerke, and T.-F. Ngai, "The Impact of If-Conversion and Branch Prediction on Program Execution on the Intel Itanium Processor," in 34th Annual International Symposium on Microarchitecture (MICRO) , pp. 182-191, December 2001. Google Scholar
- J. Emer, P. Ahuja, N. Binkert, E. Borch, R. Espasa, T. Juan, A. Klauser, C. K. Luk, S. Manne, S. S. Mukherjee, H. Patil, and S. Wallace, "Asim: A Performance Model Framework," IEEE Computer , Vol. 35, No. 2, pp. 68-76, February 2002. Google ScholarDigital Library
- J. L. Hennessy and D.A. Patterson, Computer Architecture:AQuantitative Approach , Elsevier Science, 2003. Google Scholar
- E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative System Performance , Prentice-Hall, Englewood Cliffs, New Jersey, 1984. Google Scholar
- X. Li, S. V. Adve, P. Bose, and J. A. Rivers, "Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions," in International Conference on Dependable Systems and Networks (DSN) , pp. 266-275, 2007. Google Scholar
- X. Li, S. V. Adve, P. Bose, and J. A. Rivers, "SoftArch: An Architecture-Level Tool for Modeling and Analyzing Soft Errors," in International Conference on Dependable Systems and Networks (DSN) , pp. 496-505, 2005. Google Scholar
- S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor," in 36th Annual International Symposium on Microarchitecture (MICRO) , pp. 29-40, December 2003. Google Scholar
- H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karnunanidhi, "Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation," in 37th Annual International Symposium on Microarchitecture (MICRO) , pp. 81-92, 2004. Google Scholar
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically Characterizing Large Scale Program Behavior," in 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 45-57, October 2002. Google Scholar
- N. Wang, M. Fertig, and S. Patel, "Y-Branches: When You Come to a Fork in the Road, Take It," in 12th International Conference on Parallel Architectures and Compilation Techniques (PACT) , pp. 56-67, 2003. Google Scholar
- N. Wang, A. Mahesri, and S. J. Patel, "Examining ACE Analysis Reliability Estimates Using Fault-Injection," in 34th International Symposium on Computer Architecture (ISCA) , pp. 460-469, 2007. Google Scholar
- C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," in 31st Annual International Symposium on Computer Architecture , pp. 264-275, June 2004. Google Scholar
- A. O. Allen, Probability, Statistics, and Queue Theory with Computer Science Applications , Academic Press, 1990. Google Scholar
- AMD, "BIOS and Kernel Developer's Guide for AMD Athlon¿64 and AMD Opteron¿ Processors." Publication #26094, Revision 3.14, April 2004. Available at: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF.Google Scholar
- A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan, "Computing Architectural Vulnerability Factors for Address-Based Structures," in 32nd Annual International Symposium on Computer Architecture (ISCA) , pp. 532-543, June 2005. Google Scholar
- J. L. Hennessy and D. L. Patterson, Computer Architecture: A Quantitative Approach , Morgan Kaufmann Publishers, 2003. Google ScholarDigital Library
- S. Kim and A. K. Somani, "Soft Error Sensitivity Characterization for Microprocessor Dependability Enhancement Strategy," in International Conference on Dependable Systems and Networks (DSN) , pp. 416-425, June 2002. Google Scholar
- A. Lai, C. Fide, and B. Falsafi. "Dead-Block Prediction and Dead-Block Correlating Prefetchers," in 28th International Symposium on Computer Architecture , pp. 144-154, June 2001. Google Scholar
- H. T. Nguyen, Y. Yagil, N. Seifert, and M. Reitsma, "Chip-Level Soft Error Estimation Method," IEEE Transactions on Device and Materials Reliability , Vol. 5, No. 3, pp. 365-381, September 2005.Google ScholarCross Ref
- N. Wang and S. J. Patel, "ReStore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Transactions on Dependable and Secure Computing , Vol. 3, No. 3, pp. 188-201, July-September 2006. Google ScholarDigital Library
- N. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, "Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline," in International Conference on Dependable Systems and Networks (DSN) , pp. 61-70, June 2004. Google Scholar
- D. Wood, M. Hill, and R. Kessler. "A Model for Estimating Trace-Sample Miss Ratios," in 1991 SIGMETRICS Conference on Measurement and Modeling of Computer Systems , pp. 79-89, May 1991. Google Scholar
- AMD, "BIOS and Kernel Developer's Guide for AMD Athlon¿ 64 and AMD Opteron¿ Processors," Publication #26094, Revision 3.14, April 2004. Available at: http://www.amd.com/ us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF.Google Scholar
- H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, "A1.3 GHz Fifth Generation SPARC64 Microprocessor," in International Solid-State Circuits Conference , pp. 1896-1905, 2003. Google Scholar
- D. C. Bossen, A. Kitamorn, K. F. Reick, and M. S. Floyd, "Fault-Tolerant Design of the IBM pSeries 690 System Using POWER4 Processor Technology," IBM Journal of Research and Development , Vol. 46, No. 1, pp. 77-86, 2002. Google ScholarDigital Library
- O. Ergin, O. Unsal, X. Vera, and A. Gonzalez, "Exploiting Narrow Values for Soft Error Tolerance," IEEE Computer Architecture Letters , Vol. 5, pp. 12-12, 2006. Google ScholarDigital Library
- M.Y. Hsiao, "A Class of Optimal Minimum Odd-Weight-Column SEC-DED Codes," IBM Journal of Research and Development , Vol. 14, No. 4, pp. 395-401, 1970. Google ScholarDigital Library
- S. Iacobovici, "Residue-Based Error Detection for a Shift Operation," United States Patent Application, filed August 22, 2005.Google Scholar
- Intel Corporation, Intel® 64 and IA-32 Architectures, Software Developer's Manual, Volume 3A: System Programming Guide, Part 1 . Available at: http://www.intel.com.Google Scholar
- J.-C. Lo, "Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands," IEEE Transactions on Computers , Vol. 43, No. 4, pp. 400-412, April 1994. Google ScholarDigital Library
- S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt, "Cache Scrubbing in Microprocessors: Myth or Necessity?" in 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) , pp. 37-42, March 3-5, 2004, Papeete, French Polynesia. Google Scholar
- N. Nakka, J. Xu, Z. Kalbarczyk, and R. K. Iyer, "An Architectural Framework for Providing Reliability and Security Support," Dependable Systems and Networks (DSN) , pp. 585-594, June 2004. Google Scholar
- I. A. Noufal and M. Nicolaidis, "A CAD Framework for Generating Self-Checking Multipliers Based on Residue Codes," in Design, Automation and Test in Europe Conference and Exhibition , pp. 122-129, 1999. Google Scholar
- M. Nicolaidis, "Carry Checking/Parity Prediction Adders and ALUs," IEEE Transactions on Very Large Scale Integration (VLSI) , Vol. 11, No. 1, pp. 121-128, February 2003. Google ScholarDigital Library
- M. Nicolaidis and R. O. Duarte, "Fault-Secure Parity Prediction Booth Multipliers," IEEE Design and Test of Computers , Vol. 16, No. 3, pp. 90-101, July-September 1999. Google ScholarDigital Library
- M. Nicolaidis, R. O. Duarte, S. Manich, and J. Figueras, "Fault-Secure Parity Prediction Arithmetic Operators," IEEE Design and Test of Computers , Vol. 14, No. 2, pp. 60-71, April-June 1997. Google ScholarDigital Library
- W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes , MIT Press, 1961.Google Scholar
- D. K. Pradhan, Fault-Tolerant Computer System Design , Prentice-Hall, 2003. Google Scholar
- V. K. Reddy, A. S. Al-Zawawi, and E. Rotenberg. "Assertion-Based Microarchitecture Design for Improved Fault Tolerance." in Proceedings of the 24th IEEE International Conference on Computer Design (ICCD-24) , pp. 362-369, October 2006.Google Scholar
- A. M. Saleh, J. J. Serrano, and J. H. Patel, "Reliability of Scrubbing Recovery Techniques for Memory Systems," IEEE Transactions on Reliability , Vol. 39, No. 1, pp. 114-122, April 1990.Google ScholarCross Ref
- C. E. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal , Vol. 27, pp. 379-423, 623-656, July-October, 1948.Google ScholarDigital Library
- N. Wang, M. Fertig, and S. Patel, "Y-Branches: When You Come to a Fork in the Road, Take It," in 12th International Conference on Parallel Architectures and Compilation Techniques (PACT) , pp. 56-66, 2003. Google Scholar
- C. Webb, "z6--The Next-Generation Mainframe Microprocessor," Hot Chips , August 19-21, 2007.Google Scholar
- C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Reducing the Soft Error Rate of a Microprocessor," IEEE Micro , Vol. 24, No. 6, pp. 30-37, November-December 2004. Google ScholarDigital Library
- T. M. Austin, "DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design," in 32nd Annual International Symposium on Microarchitecture (MICRO) , pp. 196-207, 1999. Google Scholar
- D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop® AdvancedArchitecture," in Proceedings. International Conference on Dependable Systems and Networks (DSN) , pp. 12-21, Yakohama, Japan, June/July 2005. Google Scholar
- T. D. Bissett, P. A. Leveille, E. Muench, G. A. Tremblay, "Loosely-Coupled, Synchronized Execution," United States Patent 5,896,523, issued April 20, 1999.Google Scholar
- M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient Fault-Recovery for Chip Multiprocessors," in Proceedings of 30th Annual International Symposium on Computer Architecture (ISCA) , pp. 98-109, June 2003. Google Scholar
- M. A. Gomaa and T. N. Vijaykumar, "Opportunistic Fault Detection," in 32nd Annual International Symposium on Computer Architecture (ISCA) , pp. 172-183, Madison, Wisconsin, USA, June 2005. Google Scholar
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives," in Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA) , pp. 99-110, Anchorage, Alaska, USA, May 2002. Google Scholar
- R. Nair and J. E. Smith, "Method and Apparatus for Fault-Tolerance Via Dual Thread Crosschecking," United States Patent Application, publication date September 19, 2002.Google Scholar
- A. Parashar, S. Gurumurthi, and A. Sivasubramaniam, "A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy," in 31st Annual International Symposium on Computer Architecture (ISCA) , pp. 376-386, June 2004. Google Scholar
- A. Parashar, S. Gurumurthi, and A. Sivasubramaniam, "SlicK: Slice-Based Locality Exploitation for Efficient Redundant Multithreading," in 12th Annual International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 95-105, October 2006. Google Scholar
- S. K. Reinhardt and S. S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading," in 27th Annual International Symposium on Computer Architecture (ISCA) , pp. 25-36, Vancouver, British Columbia, Canada, USA, June 2000. Google Scholar
- E. Rotenberg, "AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors," in 29th Annual Fault-Tolerant Computing Systems (FTCS) , p. 84, Madison, Wisconsin, USA, June 1999. Google Scholar
- D. P. Sieiorek and R. S. Swarz, Reliable Computer Systems: Design and Evaluation , A. K. Peters, 1998. Google Scholar
- T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb, "IBM's S/390 G5 Microprocessor Design," IEEE Micro , pp. 12-23, March/April 1999. Google ScholarDigital Library
- T. J. Slegel, E. Pfeffer, and J. A. Magee, "The IBM eServer z990 Microprocessor," IBM Journal of Research and Development , Vol. 48 No. 3/4, pp. 295-309, May/July 2004. Google ScholarDigital Library
- J. E. Smith and A. R. Pleszkun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Transactions on Computers , Vol. 37, No. 5, pp. 562-573, May 1988. Google ScholarDigital Library
- A. Sodani and G. S. Sohi, "Dynamic Instruction Reuse," in 24th Annual International Symposium on Computer Architecture (ISCA) , pp. 194-205, Denver, Colorado, USA, June 1997. Google Scholar
- D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," in 23rd Annual International Symposium on Computer Architecture (ISCA) , pp. 191-202, May 1999. Google Scholar
- D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," in 22nd Annual International Symposium on Computer Architecture (ISCA) , pp. 392-403, Italy, June 1995. Google Scholar
- J. Somers, "Stratus ftServer--Intel Fault Tolerant Platform," Intel Developer Forum, Fall 2002.Google Scholar
- L. Spainhower and T. A. Gregg, "IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective," IBM Journal of Research and Development , Vol. 43, No. 5/6, pp. 863-873, September/November 1999. Google ScholarDigital Library
- T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient Fault Recovery using Simultaneous Multithreading," in Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA) , May 2002. Google Scholar
- C. Webb, "z6--The Next-Generation Mainframe Microprocessor," Hot Chips , August 2007.Google Scholar
- A. Wood, R. Jardine, and W. Bartlett, "Data Integrity in HP NonStop Servers," in 2nd IEEE Workshop on Silicon Errors in Logic and System Effects (SELSE) , Urbana-Champaign, April 2006.Google Scholar
- H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama "A 1.3 GHz Fifth Generation SPARC Microprocessor," in 2003 IEEE Solid State Circuits Conference (ISSCC) , pp. 1896-1905, 2003. Google Scholar
- J. Barlett, W. Bartlett, R. Carr, D. Garcia, J. Gray, R. Horst, R. Jardine, D. Lenoski, and D. Mcguire "Fault Tolerance in Tandem Computer Systems," Technical Report 90.5, Part Number 40666, Hewlett-Packard, May 1990.Google Scholar
- W. Bartlett and L. Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Transactions on Dependable and Secure Computing , Vol. 1, No. 1, pp. 87-96, January-March 2004. Google ScholarDigital Library
- D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop® Advanced Architecture," in Proceedings of the International Conference on Dependable Systems and Networks (DSN) , pp. 12-21, 2005. Google Scholar
- B. Bloom, "Space/Time Trade-offs in Hash Coding with Allowable Errors," Communications of the ACM , Vol. 13, No. 7, pp. 422-426, July 1970. Google ScholarDigital Library
- D. Burger and T. M. Austin, "The Simplescalar Tool Set, Version 2.0," Technical Report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.Google ScholarDigital Library
- M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," Technical Report CMU-CS-99-148, School of Computer Science, Carnegie Mellon University, June 1999.Google Scholar
- M. A. Gomaa and T. N. Vijaykumar, "Opportunistic Fault Detection," in 32nd Annual International Symposium on Computer Architecture , pp. 172-183, 2005. Google Scholar
- M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient Fault-Recovery for Chip Multiprocessors," in 30th Annual International Symposium on Computer Architecture , pp. 96-109, June 2003. Google Scholar
- P. A. Green Jr., "Observations From 16 Years at a Fault-Tolerant Computer Company," in 15th Symposium on Reliable Distributed Systems , pp. 162-164, 1996. Google Scholar
- S. Hangal and M. Lam, "Tracking Down Software Bugs Using Automatic Anomaly Detection," in International Conference on Software Engineering , ICSE'02, pp. 291-301, May 2002. Google ScholarCross Ref
- S. S. Mukherjee, S. K. Reinhardt, and J. S. Emer, "Incremental Checkpointing in a Multi-Threaded Architecture," United States Patent Application, Filed August 29, 2003.Google Scholar
- J. Nakamo, P. Montesinos, K. Gharachorloo, and J. Torrellas, "ReVive I/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers," in 12th Annual International Symposium on High-Performance Computer Architecture (HPCA) , pp. 200-211, 2006.Google Scholar
- E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, "Using SimPoint for Accurate and Efficient Simulation," in ACM SIGMETRICS, the International Conference on Measurement and Modeling of Computer Systems , pp. 318-319, June 2003. Google Scholar
- M. Prvulovic, Z. Zhang, and J. Torrellas, "ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors," in 29th Annual International Symposium on Computer Architecture (ISCA) , pp. 111-122, 2002. Google Scholar
- P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee, "Perturbation-Based Fault Screening," in 13th Annual International High-Performance Computer Architecture (HPCA) , pp. 169-180, February 2007. Google Scholar
- S. K. Reinhardt, S. S. Mukherjee, and J. S. Emer, "Periodic Checkpointing in a Redundantly Multi-Threaded Architecture," United States Patent Application, Filed August 29, 2003.Google Scholar
- J. E. Smith and A. R. Pleszkun, "Implementation of Precise Interrupts in Pipelined Processors," in 12th International Symposium on Computer Architecture , pp. 291-299, 1985. Google Scholar
- J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, "Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth," IEEE Micro , Vol. 24, No. 6, pp. 22-29, November 2004. Google ScholarDigital Library
- D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, "SafetyNet: Improving theAvailability of Shared Memory Multiprocessors with Global Checkpoint/Recovery," in International Symposium on Computer Architecture (ISCA) , pp. 123-134, May 2002. Google Scholar
- T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient Fault Recovery Using Simultaneous Multithreading," in 29th Annual International Symposium on Computer Architecture , pp. 87-98, May 2002. Google Scholar
- N. J. Wang and S. J. Patel, "ReStore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Transactions on Dependable and Secure Computing , Vol. 3, No. 3, pp. 188-201, July-September 2006. Google ScholarDigital Library
- C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," in 31st Annual International Symposium on Computer Architecture (ISCA) , pp. 264-275, 2004. Google Scholar
- T. C. Bressoud and F. B. Schneider, "Hypervisor-Based Fault Tolerance," ACM Transactions on Computer Systems , Vol. 14, No. 1, pp. 80-107, February 1996. Google ScholarDigital Library
- G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz, "Application-Level Checkpointing for Shared Memory Programs," in 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 235-247, October 2004. Google Scholar
- J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques , Morgan Kaufmann Publishers, 1993. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in ACM SIGPLAN Conference on Programming Language Design and Implementation , pp. 190-200, June 2005. Google Scholar
- A. Mahmood and E. J. McCluskey, "Concurrent Error Detection Using Watchdog Processors-- A Survey," IEEE Transactions on Computers , Vol. 37, No. 2, pp. 160-174, February 1988. Google ScholarDigital Library
- Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato, "Fault Recovery Mechanism for Multiprocessor Servers," in 27th International Symposium on Fault-Tolerant Computing , pp. 184-193, 1997. Google Scholar
- J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, "ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers," in 12th International Symposium on High-Performance Computer Architecture (HPCA) , pp. 200-211, 2006.Google Scholar
- N. Nakka, Z. Kalbarczyk, R. K. Iyer, and J. Xu, "An Architectural Framework for Providing Reliability and Security Support," in International Conference on Dependable Systems and Networks (DSN) , pp. 585-594, 2004. Google Scholar
- N. Oh, P. P. Shirvani, and E. J. McCluskey, "Error Detection by Duplicated Instructions in Super-Scalar Processors," IEEE Transactions on Reliability , Vol. 51, No. 1, pp. 63-75, March 2002.Google Scholar
- G. A. Reis, J. Chang, and D. I. August, "Automatic Instruction-Level Software-Only Recovery," IEEE Micro , Vol. 27, No. 1, pp. 36-47, January 2007. Google ScholarDigital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, "SWIFT: Software Implemented Fault Tolerance," in 3rd International Symposium on Code Generation and Optimization (CGO) , pp. 243-254, March 2005. Google Scholar
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee, "Design and Evaluation of Hybrid Fault-Detection Systems," in 32nd International Symposium on Computer Architecture (ISCA) , pp. 148-159, June 2005. Google Scholar
- G. A. Reis, J. Chang, D. I. August, R. Cohn, and S. S. Mukherjee, "Configurable Transient Fault Detection via Dynamic Binary Translation," in 2nd Workshop on Architectural Reliability (WAR) , December 2006.Google Scholar
- M. A. Schuette and J. P. Shen, "Processor Control Flow Monitoring Using Signatured Instruction Streams," IEEE Transactions on Computers , Vol. C-36, No. 3, pp. 264-276, March 1987. Google Scholar
- G. Tremblay, P. Leveille, J. McCollum, M. J. Pratt, and T. Bissett, "Fault Resilient/Fault Tolerant Computing," European Patent Application Number 04254117.7, filed July 9th, 2004.Google Scholar
- K. R. Walcott, G. Humphreys, and S. Gurumurthi, "Dynamic Prediction of Architectural Vulnerability from Microarchitectural State," in International Symposium on Computer Architecture (ISCA) , pp. 516-527, San Diego, California, June 2007. Google Scholar
Cited By
- Venkatesha S and Parthasarathi R (2024). Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and Reliability, ACM Computing Surveys, 56:11, (1-76), Online publication date: 30-Nov-2024.
- Netti A, Peng Y, Omland P, Paulitsch M, Parra J, Espinosa G, Agarwal U, Chan A and Pattabiraman K (2023). Mixed precision support in HPC applications, Journal of Parallel and Distributed Computing, 181:C, Online publication date: 1-Nov-2023.
- Jia J, Liu Y, Zhang G, Gao Y and Qian D (2022). Software approaches for resilience of high performance computing systems: a survey, Frontiers of Computer Science: Selected Publications from Chinese Universities, 17:4, Online publication date: 1-Aug-2023.
- Topçu B and Öz I (2023). Soft error vulnerability prediction of GPGPU applications, The Journal of Supercomputing, 79:6, (6965-6990), Online publication date: 1-Apr-2023.
- Manzhosov E, Hastings A, Pancholi M, Piersma R, Ziad M and Sethumadhavan S Revisiting Residue Codes for Modern Memories Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture, (73-90)
- Al-haj Ahmad H and Sedaghat Y (2022). CAFI, Microprocessors & Microsystems, 94:C, Online publication date: 1-Oct-2022.
- Fischer M, Riedel O and Lechler A Comprehensive Analysis of Software-Based Fault Tolerance with Arithmetic Coding for Performant Encoding of Integer Calculations Computer Safety, Reliability, and Security, (144-157)
- Öz I and Karadaş Ö (2022). Regional soft error vulnerability and error propagation analysis for GPGPU applications, The Journal of Supercomputing, 78:3, (4095-4130), Online publication date: 1-Feb-2022.
- Arslan S and Unsal O (2021). Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading, The Journal of Supercomputing, 77:12, (14130-14160), Online publication date: 1-Dec-2021.
- Papadimitriou G and Gizopoulos D Demystifying the system vulnerability stack Proceedings of the 48th Annual International Symposium on Computer Architecture, (902-915)
- Öz I and Arslan S (2021). Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning, International Journal of Parallel Programming, 49:3, (410-439), Online publication date: 1-Jun-2021.
- Oz I and Arslan S (2019). A Survey on Multithreading Alternatives for Soft Error Fault Tolerance, ACM Computing Surveys, 52:2, (1-38), Online publication date: 31-Mar-2020.
- Sotiropolos P and Vassilakis C (2022). Detection of intermittent faults in software programs through identification of suspicious shared variable access patterns, Journal of Systems and Software, 159:C, Online publication date: 1-Jan-2020.
- Chen J, Li H, Li S, Liang X, Wu P, Tao D, Ouyang K, Liu Y, Zhao K, Guan Q and Chen Z Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-12)
- Chen J, Li H, Li S, Liang X, Wu P, Tao D, Ouyang K, Liu Y, Zhao K, Guan Q and Chen Z Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-12)
- Ozer E, Venu B, Iturbe X, Das S, Lyberis S, Biggs J, Harrod P and Penton J Error correlation prediction in lockstep processors for safety-critical systems Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, (737-748)
- Yan Z, Jiang H, Srisa-an W, Seth S and Tan Y Leverage Redundancy in Hardware Transactional Memory to Improve Cache Reliability Proceedings of the 47th International Conference on Parallel Processing, (1-10)
- Rosa F, Bandeira V, Reis R and Ost L Extensive evaluation of programming models and ISAs impact on multicore soft error reliability Proceedings of the 55th Annual Design Automation Conference, (1-6)
- da Rosa F, Bandeira V, Reis R and Ost L Extensive Evaluation of Programming Models and ISAs Impact on Multicore So Error Reliability 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), (1-6)
- Chennakesavulu M, Jayachandra Prasad T and Sumalatha V (2018). Improved Performance of Error Controlling Codes Using Pass Transistor Logic, Circuits, Systems, and Signal Processing, 37:3, (1145-1161), Online publication date: 1-Mar-2018.
- Li T, Ambrose J, Ragel R and Parameswaran S (2016). Processor Design for Soft Errors, ACM Computing Surveys, 49:3, (1-44), Online publication date: 30-Sep-2017.
- Cho H, Cheng E, Shepherd T, Cher C and Mitra S (2017). System-Level Effects of Soft Errors in Uncore Components, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36:9, (1497-1510), Online publication date: 1-Sep-2017.
- Moradian H, Lee J and Yu J (2017). Efficient Low-Cost Fault-Localization and Self-Repairing Radix-2 Signed-Digit Adders Applying the Self-Dual Concept, Journal of Signal Processing Systems, 88:3, (297-309), Online publication date: 1-Sep-2017.
- Didehban M and Shrivastava A nZDC Proceedings of the 53rd Annual Design Automation Conference, (1-6)
- Ebrahimi M, Moshrefpour M, Golanbari M and Tahoori M Fault injection acceleration by simultaneous injection of non-interacting faults Proceedings of the 53rd Annual Design Automation Conference, (1-6)
- Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J and Chen Z Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, (31-42)
- Riera M, Canal R, Abella J and Gonzalez A A detailed methodology to compute soft error rates in advanced technologies Proceedings of the 2016 Conference on Design, Automation & Test in Europe, (217-222)
- Chen L, Ebrahimi M and Tahoori M (2016). Reliability-Aware Resource Allocation and Binding in High-Level Synthesis, ACM Transactions on Design Automation of Electronic Systems, 21:2, (1-27), Online publication date: 28-Jan-2016.
- Jing N, Zhou J, Jiang J, Chen X, He W and Mao Z Redundancy based Interconnect Duplication to Mitigate Soft Errors in SRAM-based FPGAs Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, (764-769)
- Bustamante L and Al-Asaad H Detection of soft errors through checksums in redundant execution systems 2015 IEEE AUTOTESTCON, (134-137)
- Cho H, Cher C, Shepherd T and Mitra S Understanding soft errors in uncore components Proceedings of the 52nd Annual Design Automation Conference, (1-6)
- Yetim Y, Malik S and Martonosi M (2015). CommGuard, ACM SIGARCH Computer Architecture News, 43:1, (311-323), Online publication date: 29-May-2015.
- Yetim Y, Malik S and Martonosi M (2015). CommGuard, ACM SIGPLAN Notices, 50:4, (311-323), Online publication date: 12-May-2015.
- Rodopoulos D, Psychou G, Sabry M, Catthoor F, Papanikolaou A, Soudris D, Noll T and Atienza D (2015). Classification Framework for Analysis and Modeling of Physically Induced Reliability Violations, ACM Computing Surveys, 47:3, (1-33), Online publication date: 16-Apr-2015.
- Yetim Y, Malik S and Martonosi M CommGuard Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, (311-323)
- Rodopoulos D, Papanikolaou A, Catthoor F and Soudris D (2015). Demonstrating HW–SW Transient Error Mitigation on the Single-Chip Cloud Computer Data Plane, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 23:3, (507-519), Online publication date: 1-Mar-2015.
- Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V and Skadron K (2014). Real-world design and evaluation of compiler-managed GPU redundant multithreading, ACM SIGARCH Computer Architecture News, 42:3, (73-84), Online publication date: 16-Oct-2014.
- Upasani G, Vera X and González A (2014). Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery, ACM SIGARCH Computer Architecture News, 42:3, (37-48), Online publication date: 16-Oct-2014.
- Döbel B and Härtig H Can we put concurrency back into redundant multithreading? Proceedings of the 14th International Conference on Embedded Software, (1-10)
- Schirmeier H, Borchert C and Spinczyk O Rapid Fault-Space Exploration by Evolutionary Pruning Proceedings of the 33rd International Conference on Computer Safety, Reliability, and Security - Volume 8666, (17-32)
- Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V and Skadron K Real-world design and evaluation of compiler-managed GPU redundant multithreading Proceeding of the 41st annual international symposium on Computer architecuture, (73-84)
- Upasani G, Vera X and González A Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery Proceeding of the 41st annual international symposium on Computer architecuture, (37-48)
- Shrivastava A, Rhisheekesan A, Jeyapaul R and Wu C Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Proceedings of the 51st Annual Design Automation Conference, (1-6)
- Zhang H, Kochte M, Imhof M, Bauer L, Wunderlich H and Henkel J GUARD Proceedings of the 51st Annual Design Automation Conference, (1-6)
- Liu B and Wang B Embedded reconfigurable logic for ASIC design obfuscation against supply chain attacks Proceedings of the conference on Design, Automation & Test in Europe, (1-6)
- Caplan J, Mera M, Milder P and Meyer B Trade-offs in execution signature compression for reliable processor systems Proceedings of the conference on Design, Automation & Test in Europe, (1-6)
- Amin M, Shakir M, Javed A, Hassan M and Raza S (2014). Low-Cost fault tolerant methodology for real time MPSoC based embedded system, International Journal of Reconfigurable Computing, 2014, (13-13), Online publication date: 1-Jan-2014.
- Sun G, Kursun E, Rivers J and Xie Y (2013). Exploring the vulnerability of CMPs to soft errors with 3D stacked nonvolatile memory, ACM Journal on Emerging Technologies in Computing Systems, 9:3, (1-22), Online publication date: 1-Sep-2013.
- Khudia D and Mahlke S Low cost control flow protection using abstract control signatures Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems, (3-12)
- Khudia D and Mahlke S Low cost control flow protection using abstract control signatures Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems, (3-12)
- Khudia D and Mahlke S (2013). Low cost control flow protection using abstract control signatures, ACM SIGPLAN Notices, 48:5, (3-12), Online publication date: 23-May-2013.
- Lee J, Ko Y, Lee K, Youn J and Paek Y (2013). Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures, ACM Transactions on Architecture and Code Optimization, 9:4, (1-24), Online publication date: 1-Jan-2013.
- Costello F Noisy reasoners Proceedings of the 5th international conference on Artificial General Intelligence, (31-40)
- Döbel B, Härtig H and Engel M Operating system support for redundant multithreading Proceedings of the tenth ACM international conference on Embedded software, (83-92)
- Upasani G, Vera X and González A (2012). Setting an error detection infrastructure with low cost acoustic wave detectors, ACM SIGARCH Computer Architecture News, 40:3, (333-343), Online publication date: 5-Sep-2012.
- Shayan M, Singh V, Singh A and Fujita M SEU tolerant robust latch design Proceedings of the 16th international conference on Progress in VLSI Design and Test, (223-232)
- Sardashti S and Wood D UniFI Proceedings of the 26th ACM international conference on Supercomputing, (59-68)
- Upasani G, Vera X and González A Setting an error detection infrastructure with low cost acoustic wave detectors Proceedings of the 39th Annual International Symposium on Computer Architecture, (333-343)
- Pan Z and Breuer M (2012). Error Rate Estimation for Defective Circuits via Ones Counting, ACM Transactions on Design Automation of Electronic Systems, 17:1, (1-14), Online publication date: 1-Jan-2012.
- Meyer B, Calhoun B, Lach J and Skadron K Cost-effective safety and fault localization using distributed temporal redundancy Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems, (125-134)
- Agarwal R, Garg P and Torrellas J (2011). Rebound, ACM SIGARCH Computer Architecture News, 39:3, (153-164), Online publication date: 22-Jun-2011.
- Agarwal R, Garg P and Torrellas J Rebound Proceedings of the 38th annual international symposium on Computer architecture, (153-164)
- Jose M, Hu Y and Majumdar R On power and fault-tolerance optimization in FPGA physical synthesis Proceedings of the International Conference on Computer-Aided Design, (224-229)
- Lee J, Feng Z and He L In-place decomposition for robustness in FPGA Proceedings of the International Conference on Computer-Aided Design, (143-148)
- Calimera A, Loghi M, Macii E and Poncino M Dynamic indexing Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, (343-348)
- Thompto B and Hoppe B Verification for fault tolerance of the IBM system z microprocessor Proceedings of the 47th Design Automation Conference, (525-530)
- Jose M, Hu Y, Majumdar R and He L Rewiring for robustness Proceedings of the 47th Design Automation Conference, (469-474)
- Izydorczyk J (2010). Three steps to the thermal noise death of Moore's law, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 18:1, (161-165), Online publication date: 1-Jan-2010.
- Sánchez D, Aragón J and García J REPAS Proceedings of the 15th International Euro-Par Conference on Parallel Processing, (321-333)
- Väyrynen M, Singh V and Larsson E Fault-tolerant average execution time optimization for general-purpose multi-processor system-on-chips Proceedings of the Conference on Design, Automation and Test in Europe, (484-489)
- Florio V and Blondia C (2008). On the requirements of new software development, International Journal of Business Intelligence and Data Mining, 3:3, (330-349), Online publication date: 1-Dec-2008.
- Jing N, Zhou J, Jiang J, Chen X, He W and Mao Z Redundancy based interconnect duplication to mitigate soft errors in SRAM-based FPGAs 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), (764-769)
Recommendations
Soft Errors: Technology Trends, System Effects, and Protection Techniques
IOLTS '07: Proceedings of the 13th IEEE International On-Line Testing SymposiumRadiation-induced soft errors are getting worse in digital systems manufactured in advanced technologies. Stringent data integrity and availability requirements of enterprise computing and networking applications demand special attention to soft errors ...
Soft errors: the hardware-software interface
CODES+ISSS '12: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesisA recent report from the ITRS identifies soft errors, as one of the most important reliability challenges for the coming decades. Soft errors are transient errors caused by several effects e.g., voltage fluctuations, wire-cross talks, and cosmic ...