Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Skip header Section
Architecture Design for Soft ErrorsFebruary 2008
Publisher:
  • Morgan Kaufmann Publishers Inc.
  • 340 Pine Street, Sixth Floor
  • San Francisco
  • CA
  • United States
ISBN:978-0-08-055832-5
Published:22 February 2008
Pages:
360
Skip Bibliometrics Section
Reflects downloads up to 12 Nov 2024Bibliometrics
Skip Abstract Section
Abstract

This book provides a comprehensive description of the architetural techniques to tackle the soft error problem. It covers the new methodologies for quantitative analysis of soft errors as well as novel, cost-effective architectural techniques to mitigate them. To provide readers with a better grasp of the broader problem deffinition and solution space, this book also delves into the physics of soft errors and reviews current circuit and software mitigation techniques. TABLE OF CONTENTS Chapter 1: Introduction Chapter 2: Device- and Circuit-Level Modeling, Measurement, and Mitigation Chapter 3: Architectural Vulnerability Analysis Chapter 4: Advanced Architectural Vulnerability Analysis Chapter 5: Error Coding Techniques Chapter 6: Fault Detection via Redundant Execution Chapter 7: Hardware Error Recovery Chapter 8: Software Detection and Recovery * Provides the methodologies necessary to quantify the effect of radiation-induced soft errors as well as state-of-the-art techniques to protect against them

References

  1. M. Agostinelli, J. Hicks, J. Xu, B. Woolery, K. Mistry, K. Zhang, S. Jacobs, J. Jopling, W. Yang, B. Lee, T. Raz, M. Mehalel, P. Kolar, Y. Wang, J. Sandford, D. Pivin, C. Peterson, M. DiBattista, S. Pae, M. Jones, S. Johnson, and G. Subramanian, "Erratic Fluctuations of SRAM Cache Vmin at the 90 nm Process Technology Node," in IEEE International Electron Devices Meeting (IEDM) , pp. 655-658, December 2005.Google ScholarGoogle Scholar
  2. H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, "A 13 GHz Fifth Generation SPARC64 Microprocessor," in IEEE Journal of Solid State Circuits , Volume 38, Issue 11, pp. 1896-1905, November 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. R. Baumann, "Tutorial on Soft Errors," in International Reliability Physics Symposium (IRPS) Tutorial Notes , IEEE, Dallas, Texas, USA, April 2002.Google ScholarGoogle Scholar
  4. R. Baumann, T. Hossain, E. Smith, S. Murata, and H. Kitagawa, "Boron as a Primary Source of Radiation in High Density DRAMs," in IEEE Symposium on VLSI , pp. 81-82, June 1995.Google ScholarGoogle Scholar
  5. S. Borkar, "Designing Reliable Systems fromUnreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro , Volume 25, Issue 6, pp. 10-16, November/December 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Bossen, "CMOS Soft Errors and Server Design," in International Reliability Physics Symposium (IRPS) Tutorial Notes , IEEE, Dallas, Texas, USA, April 2002.Google ScholarGoogle Scholar
  7. M. W. Friedlander, A Thin Cosmic Rain: Particles from Outer Space , Harvard University Press, November 2002.Google ScholarGoogle Scholar
  8. S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, and C. Dai, "Impact of CMOS Process Scaling and SOI on the Soft Error Rates of Logic Processes," in Symposium on VLSI Technology Digest of Technical Papers , pp. 73-74, June 2001.Google ScholarGoogle Scholar
  9. M. S. Gordon, et al., "Measurement of the Flux and Energy Spectrum of Cosmic-Ray Induced Neutrons on the Ground," IEEE Transactions on Nuclear Science , Vol. 51, No. 6, Part 2, pp. 3427-3434, December 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. B. R. Havekort, et al., Performability Modelling: Techniques and Tools , John Wiley and Sons, 2001.Google ScholarGoogle Scholar
  11. P. Hazucha and C. Svensson, "Impact of CMOS Technological Scaling on the Atmospheric Neutron Soft Error Rate," IEEE Transactions on Nuclear Science , Vol. 47, No. 6, pp. 2586-2594, December 2000.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. Karnik, P. Hazucha, and J. Patel, "Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes," IEEE Transactions on Dependable and Secure Computing , Vol. 1, No. 2, pp. 128-143, April-June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. JEDEC Standard, "Measurement and Reporting of Alpha Particles and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices," JESD89 , August 2001.Google ScholarGoogle Scholar
  14. J. Maiz, S. Hareland, K. Zhang, and P. Armstrong, "Characterization of Multi-Bit Soft Error Events in Advanced SRAMs," Digest of International Electronic Device Meeting (IEDM) , pp. 21.4.1-21.4.4, December 2003.Google ScholarGoogle Scholar
  15. T. C. May and M. H. Woods, "Alpha-Particle-Induced Soft Errors in Dynamic Memories," IEEE Transactions on Electronic Devices , Vol. 26, Issue 1, pp. 2-9, January 1979.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender, "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer," IEEE Transactions on Device and Materials Reliability , Vol. 5, No. 3, pp. 329-335, September 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. E. Normand, "Single Event Upset at Ground Level," IEEE Transactions on Nuclear Science , Vol. 43, No. 6, pp. 2742-2750, December 1996.Google ScholarGoogle Scholar
  18. D. K. Pradhan, Fault-Tolerant Computer System Design , Prentice-Hall, 2003. Google ScholarGoogle Scholar
  19. G. Reis, J. Chang, N. Vachharajani, R. Rangan, D. August, and S. S. Mukherjee, "Design and Evaluation of Hybrid Fault-Detection Systems," in International Symposium on Computer Architecture (ISCA) , pp. 148-159, Madison, Wisconsin, USA, June 2005. Google ScholarGoogle Scholar
  20. N. Seifert, et al., "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices," in 44th Annual International Reliability Physics Symposium (IRPS) , pp. 217-225, 2006.Google ScholarGoogle Scholar
  21. G. R. Srinivasan, "Modeling the Cosmic-Ray-Induced Soft-Error Rate in Integrated Circuits: An Overview," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 77-89, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. H. Strathis, "Reliability Limits for the Gate Insulator in CMOS Technology," IBM Journal of Research and Development , Vol. 46, No. 2/3, pp. 265-286, March/May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Segura and C. F. Hawkins, CMOS Electronics: How ItWorks, How It Fails , Wiley-IEEE Press, 2004. Google ScholarGoogle Scholar
  24. H. H. K. Tang, "Nuclear Physics of Cosmic Ray Interaction with Semiconductor Materials: Particle-Induced Soft Errors from a Physicist's Perspective," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 91-108, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Tosaka, S. Satoh, K. Suzuki, T. Suguii, H. Ehara, G. A. Woffinden, and S. A. Wender, "Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS Circuits," in VLSI Symposium on VLSI Technology Digest of Technical Papers , pp. 148-149, June 1996.Google ScholarGoogle Scholar
  26. C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," in 31st Annual International Symposium on Computer Architecture , pp. 264-275, June 2004. Google ScholarGoogle Scholar
  27. A. P. Wood, "Software Reliability from the Customer View," IEEE Computer , Vol. 36, No. 8, pp. 37-42, August 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. F. Ziegler, "Terrestrial Cosmic Rays," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 19-39, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. F. Ziegler and W. A. Lanford, "The Effect of Cosmic Rays on Computer Memories," Science , Vol. 206, No. 776, 1979.Google ScholarGoogle Scholar
  30. J. F. Zielger and H. Puchner, SER-- History, Trends and Challenges , Cypress Semiconductor Corporation, 2004.Google ScholarGoogle Scholar
  31. M. P. Baze and S. P. Buchner, "Attenuation of Single Event Induced Pulses in CMOS Combinational Logic," IEEE Transactions on Nuclear Science , Vol. 44, No. 6, pp. 2217-2223, December 1997.Google ScholarGoogle ScholarCross RefCross Ref
  32. M. J. Bellido-Diaz, J. Juan-Chico, A. J. Acosta, M. Valencia, and J. L. Heurtas, "Logical Modeling of Delay Degradation Effect in Static CMOS Gates," IEE Proceedings Circuits, Devices, and Systems , Vol. 147, No. 2, pp. 107-117, April 2000.Google ScholarGoogle ScholarCross RefCross Ref
  33. T. Calin, M. Nicolaidis, and R. Velazco, "Upset Hardened Memory Design for Submicron CMOS Technology," IEEE Transactions on Nuclear Science , Vol. 43, No. 6, pp. 2874-2878, December 1996.Google ScholarGoogle Scholar
  34. E. H. Cannon, D. D. Reinhardt, M. S. Gordon, and P. S. Makowenskyj, "SRAM SER in 90, 130 and 180 nm Bulk and SOI Technologies," in Reliability Physics Symposium Proceedings, 2004. 42nd Annual. 2004 IEEE International , pp. 300-304, 25-29 April 2004.Google ScholarGoogle Scholar
  35. C. Constantinescu, "Neutron SER Characterization of Microprocessors," in International Conference on Dependable Systems and Networks (DSN) , pp. 754-759, July 2005. Google ScholarGoogle Scholar
  36. L. B. Freeman, "Critical Charge Calculations for a Bipolar SRMA Array," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 119-129, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. B. S. Gill, C. Papachristou, F. G. Wolff, and N. Seifert, "Node Sensitivity Analysis for Soft Errors in CMOS Logic," in International Test Conference , paper 37.2, pp. 1-9, November 2005.Google ScholarGoogle Scholar
  38. P. Hazucha, T. Karnik, J. Maiz, S. Walstra, B. Bloechel, J. Tschanz, G. Dermer, S. Hareland, P. Armstrong, and S. Borkar, "Neutron Soft Error Rate Measurements in a 90-nm CMOS Process and Scaling Trends in SRAM from 0.25-µm to 90-nm Generation," in IEDM '03 Technical Digest, IEEE International , pp. 21.5.1-21.5.4, 8-10 December, 2003.Google ScholarGoogle Scholar
  39. P. Hazucha, T. Karnik, S. Walstra, B. A. Bloechel, J. W. Tschanz, J. Maiz, K. Soumyanath, G. E. Dermer, S. Narenda, V. De, and S. Borkar, "Measurements and Analysis of SER-Tolerant Latch in a 90-nm Dual-V T CMOS Process," IEEE Journal of Solid-State Circuits , Vol. 39, No. 9, pp. 617-620, September 2004.Google ScholarGoogle ScholarCross RefCross Ref
  40. P. Hazucha and C. Svensson, "Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate," IEEE Transactions on Nuclear Science , Vol. 47, No. 6, pp. 2586-2594, December 2000.Google ScholarGoogle ScholarCross RefCross Ref
  41. P. Hazucha, C. Svensson, and S. A. Wender, "Cosmic-Ray Soft Error Rate Characterization of a Standard 0.6-µm CMOS Process," IEEE Journal of Solid-State Circuits , Vol. 35, No. 10, pp. 1422-1429, October 2000.Google ScholarGoogle ScholarCross RefCross Ref
  42. M. A. Horowitz, Timing Models for MOS Circuits , Technical Report SEL83-003, Integrated Circuits Laboratory, Stanford University, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. JEDEC standard JESD89, Measurement and Reporting of Alpha Particles and Terrestrial Cosmic-Ray-Induced Soft Errors in Semiconductor Devices , August 2001.Google ScholarGoogle Scholar
  44. T. Karnik, P. Hazucha, and J. Patel, "Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes," IEEE Transactions on Dependable and Secure Computing , Vol. 1, No. 2, pp. 128-143, April-June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Errguntla, and S. Borkar, "Selective Node Engineering for Chip-Level Soft Error Rate Improvement," in 2002 Symposium on VLSI Circuits Digest of Technical Papers , pp. 204-205, June 2002.Google ScholarGoogle Scholar
  46. P. Liden, P. Dahlgren, R. Johansson, and J. Karlsson, "On Latching Probability of Particle Induced Transient in Combinatorial Networks," in 24th Symposium on Fault-Tolerant Computing (FTCS) , pp. 340-349, June 1994.Google ScholarGoogle Scholar
  47. S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, "Robust System Design with Built-In Soft-Error Resilience," Vol. 38, No. 2, pp. 43-52, IEEE Computer , February 2005. Google ScholarGoogle Scholar
  48. K. Mohanram and N. A. Touba, "Cost-Effective Approach for Reducing Soft Error Failure Rate in Logic Circuits," in International Test Conference , Sep. 30-Oct. 2, 2003.Google ScholarGoogle Scholar
  49. P. C. Murley and G. R. Srinivasan, "Soft-Error Monte Carlo Modeling Program, SEMM," IBM Journal of Research and Development , Vol. 40, No. 1, pp. 109-118, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. E. Normand, "Single Event Upset at Ground Level," IEEE Transactions on Nuclear Science , Vol. 43, No. 6, pp. 2742-2750, December 1996.Google ScholarGoogle Scholar
  51. J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits , Prentice Hall, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. L. Rockett, "An SEU Hardened CMOS Data Latch Design," IEEE Transactions on Nuclear Science , Vol. NS-35, No. 6, pp. 1682-1687, December 1988.Google ScholarGoogle ScholarCross RefCross Ref
  53. N. Seifert, P. Shipley, M. D. Pant, V. Ambrose, and B. Gill, "Radiation-Induced Clock Jitter and Race," in International Reliability Physics Symposium , pp. 215-222, April 2005.Google ScholarGoogle Scholar
  54. N. Seifert and N. Tam, "Timing Vulnerability Factors of Sequentials," IEEE Transactions on Device and Materials Reliability , Vol. 3, No. 4, pp. 516-522, September 2004.Google ScholarGoogle ScholarCross RefCross Ref
  55. P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic," in International Conference on Dependable Systems and Networks , pp. 389-398, June 2002. Google ScholarGoogle Scholar
  56. A. Taber and E. Normand, "Single Event Upset in Avionics," IEEE Transactions on Nuclear Science , Vol. 40, No. 2, pp. 120-126, April 1993.Google ScholarGoogle ScholarCross RefCross Ref
  57. S. Yamamoto, K. Kokuryou, Y. Okada, J. Komori, E. Murakami, K. Kubota, N. Matsuoka, and Y. Nagai, "Neutron-Induced Soft Error in Logic Devices Using Quasi-Monenergetic Neutron Beam," in 42nd Annual International Reliability Physics Symposium , Phoenix, pp. 305-309, April 2004.Google ScholarGoogle Scholar
  58. M. Zhang and N. R. Shanbhag, "ASoft Error Rate Analysis (SERA) Methodology," in International Conference on Computer Aided Design , pp. 111-118, November 2004. Google ScholarGoogle Scholar
  59. J. F. Ziegler andW. A. Lanford, "Effect of Cosmic Rays on Computer Memories," Science , Vol. 206, No. 4420, pp. 776-788, November 1979.Google ScholarGoogle ScholarCross RefCross Ref
  60. J. F. Zielger and H. Puchner, SER--History, Trends and Challenges , Cypress Semiconductor Corporation, 2004.Google ScholarGoogle Scholar
  61. A. Biswas, P. Racunas, J. Emer, and S. S. Mukherjee, "Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal," Computer Architecture Letters (CAL) , December 2007. Google ScholarGoogle Scholar
  62. J. A. Butts and G. Sohi, "Dynamic Dead-Instruction Detection and Elimination," in 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 199-210, October 2002. Google ScholarGoogle Scholar
  63. B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S. Lumetta, "Performance Characterization of a Hardware Mechanism for Dynamic Optimization," in 34th Annual International Symposium on Microarchitecture (MICRO) , pp. 16-27, December 2001. Google ScholarGoogle Scholar
  64. Y. Choi, A. Knies, L. Gerke, and T.-F. Ngai, "The Impact of If-Conversion and Branch Prediction on Program Execution on the Intel Itanium Processor," in 34th Annual International Symposium on Microarchitecture (MICRO) , pp. 182-191, December 2001. Google ScholarGoogle Scholar
  65. J. Emer, P. Ahuja, N. Binkert, E. Borch, R. Espasa, T. Juan, A. Klauser, C. K. Luk, S. Manne, S. S. Mukherjee, H. Patil, and S. Wallace, "Asim: A Performance Model Framework," IEEE Computer , Vol. 35, No. 2, pp. 68-76, February 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. J. L. Hennessy and D.A. Patterson, Computer Architecture:AQuantitative Approach , Elsevier Science, 2003. Google ScholarGoogle Scholar
  67. E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative System Performance , Prentice-Hall, Englewood Cliffs, New Jersey, 1984. Google ScholarGoogle Scholar
  68. X. Li, S. V. Adve, P. Bose, and J. A. Rivers, "Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions," in International Conference on Dependable Systems and Networks (DSN) , pp. 266-275, 2007. Google ScholarGoogle Scholar
  69. X. Li, S. V. Adve, P. Bose, and J. A. Rivers, "SoftArch: An Architecture-Level Tool for Modeling and Analyzing Soft Errors," in International Conference on Dependable Systems and Networks (DSN) , pp. 496-505, 2005. Google ScholarGoogle Scholar
  70. S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor," in 36th Annual International Symposium on Microarchitecture (MICRO) , pp. 29-40, December 2003. Google ScholarGoogle Scholar
  71. H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karnunanidhi, "Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation," in 37th Annual International Symposium on Microarchitecture (MICRO) , pp. 81-92, 2004. Google ScholarGoogle Scholar
  72. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically Characterizing Large Scale Program Behavior," in 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 45-57, October 2002. Google ScholarGoogle Scholar
  73. N. Wang, M. Fertig, and S. Patel, "Y-Branches: When You Come to a Fork in the Road, Take It," in 12th International Conference on Parallel Architectures and Compilation Techniques (PACT) , pp. 56-67, 2003. Google ScholarGoogle Scholar
  74. N. Wang, A. Mahesri, and S. J. Patel, "Examining ACE Analysis Reliability Estimates Using Fault-Injection," in 34th International Symposium on Computer Architecture (ISCA) , pp. 460-469, 2007. Google ScholarGoogle Scholar
  75. C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," in 31st Annual International Symposium on Computer Architecture , pp. 264-275, June 2004. Google ScholarGoogle Scholar
  76. A. O. Allen, Probability, Statistics, and Queue Theory with Computer Science Applications , Academic Press, 1990. Google ScholarGoogle Scholar
  77. AMD, "BIOS and Kernel Developer's Guide for AMD Athlon¿64 and AMD Opteron¿ Processors." Publication #26094, Revision 3.14, April 2004. Available at: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF.Google ScholarGoogle Scholar
  78. A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan, "Computing Architectural Vulnerability Factors for Address-Based Structures," in 32nd Annual International Symposium on Computer Architecture (ISCA) , pp. 532-543, June 2005. Google ScholarGoogle Scholar
  79. J. L. Hennessy and D. L. Patterson, Computer Architecture: A Quantitative Approach , Morgan Kaufmann Publishers, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. S. Kim and A. K. Somani, "Soft Error Sensitivity Characterization for Microprocessor Dependability Enhancement Strategy," in International Conference on Dependable Systems and Networks (DSN) , pp. 416-425, June 2002. Google ScholarGoogle Scholar
  81. A. Lai, C. Fide, and B. Falsafi. "Dead-Block Prediction and Dead-Block Correlating Prefetchers," in 28th International Symposium on Computer Architecture , pp. 144-154, June 2001. Google ScholarGoogle Scholar
  82. H. T. Nguyen, Y. Yagil, N. Seifert, and M. Reitsma, "Chip-Level Soft Error Estimation Method," IEEE Transactions on Device and Materials Reliability , Vol. 5, No. 3, pp. 365-381, September 2005.Google ScholarGoogle ScholarCross RefCross Ref
  83. N. Wang and S. J. Patel, "ReStore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Transactions on Dependable and Secure Computing , Vol. 3, No. 3, pp. 188-201, July-September 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. N. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, "Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline," in International Conference on Dependable Systems and Networks (DSN) , pp. 61-70, June 2004. Google ScholarGoogle Scholar
  85. D. Wood, M. Hill, and R. Kessler. "A Model for Estimating Trace-Sample Miss Ratios," in 1991 SIGMETRICS Conference on Measurement and Modeling of Computer Systems , pp. 79-89, May 1991. Google ScholarGoogle Scholar
  86. AMD, "BIOS and Kernel Developer's Guide for AMD Athlon¿ 64 and AMD Opteron¿ Processors," Publication #26094, Revision 3.14, April 2004. Available at: http://www.amd.com/ us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF.Google ScholarGoogle Scholar
  87. H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, "A1.3 GHz Fifth Generation SPARC64 Microprocessor," in International Solid-State Circuits Conference , pp. 1896-1905, 2003. Google ScholarGoogle Scholar
  88. D. C. Bossen, A. Kitamorn, K. F. Reick, and M. S. Floyd, "Fault-Tolerant Design of the IBM pSeries 690 System Using POWER4 Processor Technology," IBM Journal of Research and Development , Vol. 46, No. 1, pp. 77-86, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. O. Ergin, O. Unsal, X. Vera, and A. Gonzalez, "Exploiting Narrow Values for Soft Error Tolerance," IEEE Computer Architecture Letters , Vol. 5, pp. 12-12, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. M.Y. Hsiao, "A Class of Optimal Minimum Odd-Weight-Column SEC-DED Codes," IBM Journal of Research and Development , Vol. 14, No. 4, pp. 395-401, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. S. Iacobovici, "Residue-Based Error Detection for a Shift Operation," United States Patent Application, filed August 22, 2005.Google ScholarGoogle Scholar
  92. Intel Corporation, Intel® 64 and IA-32 Architectures, Software Developer's Manual, Volume 3A: System Programming Guide, Part 1 . Available at: http://www.intel.com.Google ScholarGoogle Scholar
  93. J.-C. Lo, "Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands," IEEE Transactions on Computers , Vol. 43, No. 4, pp. 400-412, April 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt, "Cache Scrubbing in Microprocessors: Myth or Necessity?" in 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) , pp. 37-42, March 3-5, 2004, Papeete, French Polynesia. Google ScholarGoogle Scholar
  95. N. Nakka, J. Xu, Z. Kalbarczyk, and R. K. Iyer, "An Architectural Framework for Providing Reliability and Security Support," Dependable Systems and Networks (DSN) , pp. 585-594, June 2004. Google ScholarGoogle Scholar
  96. I. A. Noufal and M. Nicolaidis, "A CAD Framework for Generating Self-Checking Multipliers Based on Residue Codes," in Design, Automation and Test in Europe Conference and Exhibition , pp. 122-129, 1999. Google ScholarGoogle Scholar
  97. M. Nicolaidis, "Carry Checking/Parity Prediction Adders and ALUs," IEEE Transactions on Very Large Scale Integration (VLSI) , Vol. 11, No. 1, pp. 121-128, February 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. M. Nicolaidis and R. O. Duarte, "Fault-Secure Parity Prediction Booth Multipliers," IEEE Design and Test of Computers , Vol. 16, No. 3, pp. 90-101, July-September 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. M. Nicolaidis, R. O. Duarte, S. Manich, and J. Figueras, "Fault-Secure Parity Prediction Arithmetic Operators," IEEE Design and Test of Computers , Vol. 14, No. 2, pp. 60-71, April-June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes , MIT Press, 1961.Google ScholarGoogle Scholar
  101. D. K. Pradhan, Fault-Tolerant Computer System Design , Prentice-Hall, 2003. Google ScholarGoogle Scholar
  102. V. K. Reddy, A. S. Al-Zawawi, and E. Rotenberg. "Assertion-Based Microarchitecture Design for Improved Fault Tolerance." in Proceedings of the 24th IEEE International Conference on Computer Design (ICCD-24) , pp. 362-369, October 2006.Google ScholarGoogle Scholar
  103. A. M. Saleh, J. J. Serrano, and J. H. Patel, "Reliability of Scrubbing Recovery Techniques for Memory Systems," IEEE Transactions on Reliability , Vol. 39, No. 1, pp. 114-122, April 1990.Google ScholarGoogle ScholarCross RefCross Ref
  104. C. E. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal , Vol. 27, pp. 379-423, 623-656, July-October, 1948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. N. Wang, M. Fertig, and S. Patel, "Y-Branches: When You Come to a Fork in the Road, Take It," in 12th International Conference on Parallel Architectures and Compilation Techniques (PACT) , pp. 56-66, 2003. Google ScholarGoogle Scholar
  106. C. Webb, "z6--The Next-Generation Mainframe Microprocessor," Hot Chips , August 19-21, 2007.Google ScholarGoogle Scholar
  107. C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Reducing the Soft Error Rate of a Microprocessor," IEEE Micro , Vol. 24, No. 6, pp. 30-37, November-December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. T. M. Austin, "DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design," in 32nd Annual International Symposium on Microarchitecture (MICRO) , pp. 196-207, 1999. Google ScholarGoogle Scholar
  109. D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop® AdvancedArchitecture," in Proceedings. International Conference on Dependable Systems and Networks (DSN) , pp. 12-21, Yakohama, Japan, June/July 2005. Google ScholarGoogle Scholar
  110. T. D. Bissett, P. A. Leveille, E. Muench, G. A. Tremblay, "Loosely-Coupled, Synchronized Execution," United States Patent 5,896,523, issued April 20, 1999.Google ScholarGoogle Scholar
  111. M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient Fault-Recovery for Chip Multiprocessors," in Proceedings of 30th Annual International Symposium on Computer Architecture (ISCA) , pp. 98-109, June 2003. Google ScholarGoogle Scholar
  112. M. A. Gomaa and T. N. Vijaykumar, "Opportunistic Fault Detection," in 32nd Annual International Symposium on Computer Architecture (ISCA) , pp. 172-183, Madison, Wisconsin, USA, June 2005. Google ScholarGoogle Scholar
  113. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives," in Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA) , pp. 99-110, Anchorage, Alaska, USA, May 2002. Google ScholarGoogle Scholar
  114. R. Nair and J. E. Smith, "Method and Apparatus for Fault-Tolerance Via Dual Thread Crosschecking," United States Patent Application, publication date September 19, 2002.Google ScholarGoogle Scholar
  115. A. Parashar, S. Gurumurthi, and A. Sivasubramaniam, "A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy," in 31st Annual International Symposium on Computer Architecture (ISCA) , pp. 376-386, June 2004. Google ScholarGoogle Scholar
  116. A. Parashar, S. Gurumurthi, and A. Sivasubramaniam, "SlicK: Slice-Based Locality Exploitation for Efficient Redundant Multithreading," in 12th Annual International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 95-105, October 2006. Google ScholarGoogle Scholar
  117. S. K. Reinhardt and S. S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading," in 27th Annual International Symposium on Computer Architecture (ISCA) , pp. 25-36, Vancouver, British Columbia, Canada, USA, June 2000. Google ScholarGoogle Scholar
  118. E. Rotenberg, "AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors," in 29th Annual Fault-Tolerant Computing Systems (FTCS) , p. 84, Madison, Wisconsin, USA, June 1999. Google ScholarGoogle Scholar
  119. D. P. Sieiorek and R. S. Swarz, Reliable Computer Systems: Design and Evaluation , A. K. Peters, 1998. Google ScholarGoogle Scholar
  120. T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb, "IBM's S/390 G5 Microprocessor Design," IEEE Micro , pp. 12-23, March/April 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. T. J. Slegel, E. Pfeffer, and J. A. Magee, "The IBM eServer z990 Microprocessor," IBM Journal of Research and Development , Vol. 48 No. 3/4, pp. 295-309, May/July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. J. E. Smith and A. R. Pleszkun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Transactions on Computers , Vol. 37, No. 5, pp. 562-573, May 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. A. Sodani and G. S. Sohi, "Dynamic Instruction Reuse," in 24th Annual International Symposium on Computer Architecture (ISCA) , pp. 194-205, Denver, Colorado, USA, June 1997. Google ScholarGoogle Scholar
  124. D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," in 23rd Annual International Symposium on Computer Architecture (ISCA) , pp. 191-202, May 1999. Google ScholarGoogle Scholar
  125. D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," in 22nd Annual International Symposium on Computer Architecture (ISCA) , pp. 392-403, Italy, June 1995. Google ScholarGoogle Scholar
  126. J. Somers, "Stratus ftServer--Intel Fault Tolerant Platform," Intel Developer Forum, Fall 2002.Google ScholarGoogle Scholar
  127. L. Spainhower and T. A. Gregg, "IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective," IBM Journal of Research and Development , Vol. 43, No. 5/6, pp. 863-873, September/November 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient Fault Recovery using Simultaneous Multithreading," in Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA) , May 2002. Google ScholarGoogle Scholar
  129. C. Webb, "z6--The Next-Generation Mainframe Microprocessor," Hot Chips , August 2007.Google ScholarGoogle Scholar
  130. A. Wood, R. Jardine, and W. Bartlett, "Data Integrity in HP NonStop Servers," in 2nd IEEE Workshop on Silicon Errors in Logic and System Effects (SELSE) , Urbana-Champaign, April 2006.Google ScholarGoogle Scholar
  131. H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama "A 1.3 GHz Fifth Generation SPARC Microprocessor," in 2003 IEEE Solid State Circuits Conference (ISSCC) , pp. 1896-1905, 2003. Google ScholarGoogle Scholar
  132. J. Barlett, W. Bartlett, R. Carr, D. Garcia, J. Gray, R. Horst, R. Jardine, D. Lenoski, and D. Mcguire "Fault Tolerance in Tandem Computer Systems," Technical Report 90.5, Part Number 40666, Hewlett-Packard, May 1990.Google ScholarGoogle Scholar
  133. W. Bartlett and L. Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Transactions on Dependable and Secure Computing , Vol. 1, No. 1, pp. 87-96, January-March 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop® Advanced Architecture," in Proceedings of the International Conference on Dependable Systems and Networks (DSN) , pp. 12-21, 2005. Google ScholarGoogle Scholar
  135. B. Bloom, "Space/Time Trade-offs in Hash Coding with Allowable Errors," Communications of the ACM , Vol. 13, No. 7, pp. 422-426, July 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. D. Burger and T. M. Austin, "The Simplescalar Tool Set, Version 2.0," Technical Report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," Technical Report CMU-CS-99-148, School of Computer Science, Carnegie Mellon University, June 1999.Google ScholarGoogle Scholar
  138. M. A. Gomaa and T. N. Vijaykumar, "Opportunistic Fault Detection," in 32nd Annual International Symposium on Computer Architecture , pp. 172-183, 2005. Google ScholarGoogle Scholar
  139. M. A. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz, "Transient Fault-Recovery for Chip Multiprocessors," in 30th Annual International Symposium on Computer Architecture , pp. 96-109, June 2003. Google ScholarGoogle Scholar
  140. P. A. Green Jr., "Observations From 16 Years at a Fault-Tolerant Computer Company," in 15th Symposium on Reliable Distributed Systems , pp. 162-164, 1996. Google ScholarGoogle Scholar
  141. S. Hangal and M. Lam, "Tracking Down Software Bugs Using Automatic Anomaly Detection," in International Conference on Software Engineering , ICSE'02, pp. 291-301, May 2002. Google ScholarGoogle ScholarCross RefCross Ref
  142. S. S. Mukherjee, S. K. Reinhardt, and J. S. Emer, "Incremental Checkpointing in a Multi-Threaded Architecture," United States Patent Application, Filed August 29, 2003.Google ScholarGoogle Scholar
  143. J. Nakamo, P. Montesinos, K. Gharachorloo, and J. Torrellas, "ReVive I/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers," in 12th Annual International Symposium on High-Performance Computer Architecture (HPCA) , pp. 200-211, 2006.Google ScholarGoogle Scholar
  144. E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, "Using SimPoint for Accurate and Efficient Simulation," in ACM SIGMETRICS, the International Conference on Measurement and Modeling of Computer Systems , pp. 318-319, June 2003. Google ScholarGoogle Scholar
  145. M. Prvulovic, Z. Zhang, and J. Torrellas, "ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors," in 29th Annual International Symposium on Computer Architecture (ISCA) , pp. 111-122, 2002. Google ScholarGoogle Scholar
  146. P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee, "Perturbation-Based Fault Screening," in 13th Annual International High-Performance Computer Architecture (HPCA) , pp. 169-180, February 2007. Google ScholarGoogle Scholar
  147. S. K. Reinhardt, S. S. Mukherjee, and J. S. Emer, "Periodic Checkpointing in a Redundantly Multi-Threaded Architecture," United States Patent Application, Filed August 29, 2003.Google ScholarGoogle Scholar
  148. J. E. Smith and A. R. Pleszkun, "Implementation of Precise Interrupts in Pipelined Processors," in 12th International Symposium on Computer Architecture , pp. 291-299, 1985. Google ScholarGoogle Scholar
  149. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, "Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth," IEEE Micro , Vol. 24, No. 6, pp. 22-29, November 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, "SafetyNet: Improving theAvailability of Shared Memory Multiprocessors with Global Checkpoint/Recovery," in International Symposium on Computer Architecture (ISCA) , pp. 123-134, May 2002. Google ScholarGoogle Scholar
  151. T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient Fault Recovery Using Simultaneous Multithreading," in 29th Annual International Symposium on Computer Architecture , pp. 87-98, May 2002. Google ScholarGoogle Scholar
  152. N. J. Wang and S. J. Patel, "ReStore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Transactions on Dependable and Secure Computing , Vol. 3, No. 3, pp. 188-201, July-September 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," in 31st Annual International Symposium on Computer Architecture (ISCA) , pp. 264-275, 2004. Google ScholarGoogle Scholar
  154. T. C. Bressoud and F. B. Schneider, "Hypervisor-Based Fault Tolerance," ACM Transactions on Computer Systems , Vol. 14, No. 1, pp. 80-107, February 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz, "Application-Level Checkpointing for Shared Memory Programs," in 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pp. 235-247, October 2004. Google ScholarGoogle Scholar
  156. J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques , Morgan Kaufmann Publishers, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in ACM SIGPLAN Conference on Programming Language Design and Implementation , pp. 190-200, June 2005. Google ScholarGoogle Scholar
  158. A. Mahmood and E. J. McCluskey, "Concurrent Error Detection Using Watchdog Processors-- A Survey," IEEE Transactions on Computers , Vol. 37, No. 2, pp. 160-174, February 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  159. Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato, "Fault Recovery Mechanism for Multiprocessor Servers," in 27th International Symposium on Fault-Tolerant Computing , pp. 184-193, 1997. Google ScholarGoogle Scholar
  160. J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, "ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers," in 12th International Symposium on High-Performance Computer Architecture (HPCA) , pp. 200-211, 2006.Google ScholarGoogle Scholar
  161. N. Nakka, Z. Kalbarczyk, R. K. Iyer, and J. Xu, "An Architectural Framework for Providing Reliability and Security Support," in International Conference on Dependable Systems and Networks (DSN) , pp. 585-594, 2004. Google ScholarGoogle Scholar
  162. N. Oh, P. P. Shirvani, and E. J. McCluskey, "Error Detection by Duplicated Instructions in Super-Scalar Processors," IEEE Transactions on Reliability , Vol. 51, No. 1, pp. 63-75, March 2002.Google ScholarGoogle Scholar
  163. G. A. Reis, J. Chang, and D. I. August, "Automatic Instruction-Level Software-Only Recovery," IEEE Micro , Vol. 27, No. 1, pp. 36-47, January 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  164. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, "SWIFT: Software Implemented Fault Tolerance," in 3rd International Symposium on Code Generation and Optimization (CGO) , pp. 243-254, March 2005. Google ScholarGoogle Scholar
  165. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee, "Design and Evaluation of Hybrid Fault-Detection Systems," in 32nd International Symposium on Computer Architecture (ISCA) , pp. 148-159, June 2005. Google ScholarGoogle Scholar
  166. G. A. Reis, J. Chang, D. I. August, R. Cohn, and S. S. Mukherjee, "Configurable Transient Fault Detection via Dynamic Binary Translation," in 2nd Workshop on Architectural Reliability (WAR) , December 2006.Google ScholarGoogle Scholar
  167. M. A. Schuette and J. P. Shen, "Processor Control Flow Monitoring Using Signatured Instruction Streams," IEEE Transactions on Computers , Vol. C-36, No. 3, pp. 264-276, March 1987. Google ScholarGoogle Scholar
  168. G. Tremblay, P. Leveille, J. McCollum, M. J. Pratt, and T. Bissett, "Fault Resilient/Fault Tolerant Computing," European Patent Application Number 04254117.7, filed July 9th, 2004.Google ScholarGoogle Scholar
  169. K. R. Walcott, G. Humphreys, and S. Gurumurthi, "Dynamic Prediction of Architectural Vulnerability from Microarchitectural State," in International Symposium on Computer Architecture (ISCA) , pp. 516-527, San Diego, California, June 2007. Google ScholarGoogle Scholar

Cited By

  1. ACM
    Venkatesha S and Parthasarathi R (2024). Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and Reliability, ACM Computing Surveys, 56:11, (1-76), Online publication date: 30-Nov-2024.
  2. Netti A, Peng Y, Omland P, Paulitsch M, Parra J, Espinosa G, Agarwal U, Chan A and Pattabiraman K (2023). Mixed precision support in HPC applications, Journal of Parallel and Distributed Computing, 181:C, Online publication date: 1-Nov-2023.
  3. Jia J, Liu Y, Zhang G, Gao Y and Qian D (2022). Software approaches for resilience of high performance computing systems: a survey, Frontiers of Computer Science: Selected Publications from Chinese Universities, 17:4, Online publication date: 1-Aug-2023.
  4. Topçu B and Öz I (2023). Soft error vulnerability prediction of GPGPU applications, The Journal of Supercomputing, 79:6, (6965-6990), Online publication date: 1-Apr-2023.
  5. Manzhosov E, Hastings A, Pancholi M, Piersma R, Ziad M and Sethumadhavan S Revisiting Residue Codes for Modern Memories Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture, (73-90)
  6. Al-haj Ahmad H and Sedaghat Y (2022). CAFI, Microprocessors & Microsystems, 94:C, Online publication date: 1-Oct-2022.
  7. Fischer M, Riedel O and Lechler A Comprehensive Analysis of Software-Based Fault Tolerance with Arithmetic Coding for Performant Encoding of Integer Calculations Computer Safety, Reliability, and Security, (144-157)
  8. Öz I and Karadaş Ö (2022). Regional soft error vulnerability and error propagation analysis for GPGPU applications, The Journal of Supercomputing, 78:3, (4095-4130), Online publication date: 1-Feb-2022.
  9. Arslan S and Unsal O (2021). Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading, The Journal of Supercomputing, 77:12, (14130-14160), Online publication date: 1-Dec-2021.
  10. Papadimitriou G and Gizopoulos D Demystifying the system vulnerability stack Proceedings of the 48th Annual International Symposium on Computer Architecture, (902-915)
  11. Öz I and Arslan S (2021). Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning, International Journal of Parallel Programming, 49:3, (410-439), Online publication date: 1-Jun-2021.
  12. ACM
    Oz I and Arslan S (2019). A Survey on Multithreading Alternatives for Soft Error Fault Tolerance, ACM Computing Surveys, 52:2, (1-38), Online publication date: 31-Mar-2020.
  13. Sotiropolos P and Vassilakis C (2022). Detection of intermittent faults in software programs through identification of suspicious shared variable access patterns, Journal of Systems and Software, 159:C, Online publication date: 1-Jan-2020.
  14. Chen J, Li H, Li S, Liang X, Wu P, Tao D, Ouyang K, Liu Y, Zhao K, Guan Q and Chen Z Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-12)
  15. Chen J, Li H, Li S, Liang X, Wu P, Tao D, Ouyang K, Liu Y, Zhao K, Guan Q and Chen Z Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-12)
  16. Ozer E, Venu B, Iturbe X, Das S, Lyberis S, Biggs J, Harrod P and Penton J Error correlation prediction in lockstep processors for safety-critical systems Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, (737-748)
  17. ACM
    Yan Z, Jiang H, Srisa-an W, Seth S and Tan Y Leverage Redundancy in Hardware Transactional Memory to Improve Cache Reliability Proceedings of the 47th International Conference on Parallel Processing, (1-10)
  18. ACM
    Rosa F, Bandeira V, Reis R and Ost L Extensive evaluation of programming models and ISAs impact on multicore soft error reliability Proceedings of the 55th Annual Design Automation Conference, (1-6)
  19. da Rosa F, Bandeira V, Reis R and Ost L Extensive Evaluation of Programming Models and ISAs Impact on Multicore So Error Reliability 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), (1-6)
  20. Chennakesavulu M, Jayachandra Prasad T and Sumalatha V (2018). Improved Performance of Error Controlling Codes Using Pass Transistor Logic, Circuits, Systems, and Signal Processing, 37:3, (1145-1161), Online publication date: 1-Mar-2018.
  21. ACM
    Li T, Ambrose J, Ragel R and Parameswaran S (2016). Processor Design for Soft Errors, ACM Computing Surveys, 49:3, (1-44), Online publication date: 30-Sep-2017.
  22. Cho H, Cheng E, Shepherd T, Cher C and Mitra S (2017). System-Level Effects of Soft Errors in Uncore Components, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36:9, (1497-1510), Online publication date: 1-Sep-2017.
  23. Moradian H, Lee J and Yu J (2017). Efficient Low-Cost Fault-Localization and Self-Repairing Radix-2 Signed-Digit Adders Applying the Self-Dual Concept, Journal of Signal Processing Systems, 88:3, (297-309), Online publication date: 1-Sep-2017.
  24. ACM
    Didehban M and Shrivastava A nZDC Proceedings of the 53rd Annual Design Automation Conference, (1-6)
  25. ACM
    Ebrahimi M, Moshrefpour M, Golanbari M and Tahoori M Fault injection acceleration by simultaneous injection of non-interacting faults Proceedings of the 53rd Annual Design Automation Conference, (1-6)
  26. ACM
    Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J and Chen Z Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, (31-42)
  27. Riera M, Canal R, Abella J and Gonzalez A A detailed methodology to compute soft error rates in advanced technologies Proceedings of the 2016 Conference on Design, Automation & Test in Europe, (217-222)
  28. ACM
    Chen L, Ebrahimi M and Tahoori M (2016). Reliability-Aware Resource Allocation and Binding in High-Level Synthesis, ACM Transactions on Design Automation of Electronic Systems, 21:2, (1-27), Online publication date: 28-Jan-2016.
  29. Jing N, Zhou J, Jiang J, Chen X, He W and Mao Z Redundancy based Interconnect Duplication to Mitigate Soft Errors in SRAM-based FPGAs Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, (764-769)
  30. Bustamante L and Al-Asaad H Detection of soft errors through checksums in redundant execution systems 2015 IEEE AUTOTESTCON, (134-137)
  31. ACM
    Cho H, Cher C, Shepherd T and Mitra S Understanding soft errors in uncore components Proceedings of the 52nd Annual Design Automation Conference, (1-6)
  32. ACM
    Yetim Y, Malik S and Martonosi M (2015). CommGuard, ACM SIGARCH Computer Architecture News, 43:1, (311-323), Online publication date: 29-May-2015.
  33. ACM
    Yetim Y, Malik S and Martonosi M (2015). CommGuard, ACM SIGPLAN Notices, 50:4, (311-323), Online publication date: 12-May-2015.
  34. ACM
    Rodopoulos D, Psychou G, Sabry M, Catthoor F, Papanikolaou A, Soudris D, Noll T and Atienza D (2015). Classification Framework for Analysis and Modeling of Physically Induced Reliability Violations, ACM Computing Surveys, 47:3, (1-33), Online publication date: 16-Apr-2015.
  35. ACM
    Yetim Y, Malik S and Martonosi M CommGuard Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, (311-323)
  36. Rodopoulos D, Papanikolaou A, Catthoor F and Soudris D (2015). Demonstrating HW–SW Transient Error Mitigation on the Single-Chip Cloud Computer Data Plane, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 23:3, (507-519), Online publication date: 1-Mar-2015.
  37. ACM
    Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V and Skadron K (2014). Real-world design and evaluation of compiler-managed GPU redundant multithreading, ACM SIGARCH Computer Architecture News, 42:3, (73-84), Online publication date: 16-Oct-2014.
  38. ACM
    Upasani G, Vera X and González A (2014). Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery, ACM SIGARCH Computer Architecture News, 42:3, (37-48), Online publication date: 16-Oct-2014.
  39. ACM
    Döbel B and Härtig H Can we put concurrency back into redundant multithreading? Proceedings of the 14th International Conference on Embedded Software, (1-10)
  40. Schirmeier H, Borchert C and Spinczyk O Rapid Fault-Space Exploration by Evolutionary Pruning Proceedings of the 33rd International Conference on Computer Safety, Reliability, and Security - Volume 8666, (17-32)
  41. Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V and Skadron K Real-world design and evaluation of compiler-managed GPU redundant multithreading Proceeding of the 41st annual international symposium on Computer architecuture, (73-84)
  42. Upasani G, Vera X and González A Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery Proceeding of the 41st annual international symposium on Computer architecuture, (37-48)
  43. ACM
    Shrivastava A, Rhisheekesan A, Jeyapaul R and Wu C Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Proceedings of the 51st Annual Design Automation Conference, (1-6)
  44. ACM
    Zhang H, Kochte M, Imhof M, Bauer L, Wunderlich H and Henkel J GUARD Proceedings of the 51st Annual Design Automation Conference, (1-6)
  45. Liu B and Wang B Embedded reconfigurable logic for ASIC design obfuscation against supply chain attacks Proceedings of the conference on Design, Automation & Test in Europe, (1-6)
  46. Caplan J, Mera M, Milder P and Meyer B Trade-offs in execution signature compression for reliable processor systems Proceedings of the conference on Design, Automation & Test in Europe, (1-6)
  47. Amin M, Shakir M, Javed A, Hassan M and Raza S (2014). Low-Cost fault tolerant methodology for real time MPSoC based embedded system, International Journal of Reconfigurable Computing, 2014, (13-13), Online publication date: 1-Jan-2014.
  48. ACM
    Sun G, Kursun E, Rivers J and Xie Y (2013). Exploring the vulnerability of CMPs to soft errors with 3D stacked nonvolatile memory, ACM Journal on Emerging Technologies in Computing Systems, 9:3, (1-22), Online publication date: 1-Sep-2013.
  49. ACM
    Khudia D and Mahlke S Low cost control flow protection using abstract control signatures Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems, (3-12)
  50. ACM
    Khudia D and Mahlke S Low cost control flow protection using abstract control signatures Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems, (3-12)
  51. ACM
    Khudia D and Mahlke S (2013). Low cost control flow protection using abstract control signatures, ACM SIGPLAN Notices, 48:5, (3-12), Online publication date: 23-May-2013.
  52. ACM
    Lee J, Ko Y, Lee K, Youn J and Paek Y (2013). Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures, ACM Transactions on Architecture and Code Optimization, 9:4, (1-24), Online publication date: 1-Jan-2013.
  53. Costello F Noisy reasoners Proceedings of the 5th international conference on Artificial General Intelligence, (31-40)
  54. ACM
    Döbel B, Härtig H and Engel M Operating system support for redundant multithreading Proceedings of the tenth ACM international conference on Embedded software, (83-92)
  55. ACM
    Upasani G, Vera X and González A (2012). Setting an error detection infrastructure with low cost acoustic wave detectors, ACM SIGARCH Computer Architecture News, 40:3, (333-343), Online publication date: 5-Sep-2012.
  56. Shayan M, Singh V, Singh A and Fujita M SEU tolerant robust latch design Proceedings of the 16th international conference on Progress in VLSI Design and Test, (223-232)
  57. ACM
    Sardashti S and Wood D UniFI Proceedings of the 26th ACM international conference on Supercomputing, (59-68)
  58. Upasani G, Vera X and González A Setting an error detection infrastructure with low cost acoustic wave detectors Proceedings of the 39th Annual International Symposium on Computer Architecture, (333-343)
  59. ACM
    Pan Z and Breuer M (2012). Error Rate Estimation for Defective Circuits via Ones Counting, ACM Transactions on Design Automation of Electronic Systems, 17:1, (1-14), Online publication date: 1-Jan-2012.
  60. ACM
    Meyer B, Calhoun B, Lach J and Skadron K Cost-effective safety and fault localization using distributed temporal redundancy Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems, (125-134)
  61. ACM
    Agarwal R, Garg P and Torrellas J (2011). Rebound, ACM SIGARCH Computer Architecture News, 39:3, (153-164), Online publication date: 22-Jun-2011.
  62. ACM
    Agarwal R, Garg P and Torrellas J Rebound Proceedings of the 38th annual international symposium on Computer architecture, (153-164)
  63. Jose M, Hu Y and Majumdar R On power and fault-tolerance optimization in FPGA physical synthesis Proceedings of the International Conference on Computer-Aided Design, (224-229)
  64. Lee J, Feng Z and He L In-place decomposition for robustness in FPGA Proceedings of the International Conference on Computer-Aided Design, (143-148)
  65. ACM
    Calimera A, Loghi M, Macii E and Poncino M Dynamic indexing Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, (343-348)
  66. ACM
    Thompto B and Hoppe B Verification for fault tolerance of the IBM system z microprocessor Proceedings of the 47th Design Automation Conference, (525-530)
  67. ACM
    Jose M, Hu Y, Majumdar R and He L Rewiring for robustness Proceedings of the 47th Design Automation Conference, (469-474)
  68. Izydorczyk J (2010). Three steps to the thermal noise death of Moore's law, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 18:1, (161-165), Online publication date: 1-Jan-2010.
  69. Sánchez D, Aragón J and García J REPAS Proceedings of the 15th International Euro-Par Conference on Parallel Processing, (321-333)
  70. Väyrynen M, Singh V and Larsson E Fault-tolerant average execution time optimization for general-purpose multi-processor system-on-chips Proceedings of the Conference on Design, Automation and Test in Europe, (484-489)
  71. Florio V and Blondia C (2008). On the requirements of new software development, International Journal of Business Intelligence and Data Mining, 3:3, (330-349), Online publication date: 1-Dec-2008.
  72. Jing N, Zhou J, Jiang J, Chen X, He W and Mao Z Redundancy based interconnect duplication to mitigate soft errors in SRAM-based FPGAs 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), (764-769)
Contributors
  • Intel Corporation

Recommendations