Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Transient Fault Tolerance in Digital Systems

Published: 01 February 1994 Publication History

Abstract

It is hard to shield systems effectively from transient faults (fault avoidance techniques). So some other means must be employed to assure appropriate levels of transient fault tolerance (insensitivity to transient faults). They are based on fault-masking and fault recovery ideas. Having analyzed this problem, the author identifies critical design points and outlines some practical solutions that refer to efficient on-line detectors (detecting errors during the system operation) and error handling procedures. This framework provides a basis for understanding transient fault problems in digital systems. It can be helpful in selecting optimum techniques to mask or eliminate transient fault effects in developed systems.

References

[1]
1. B.W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, Reading, Mass., 1989.]]
[2]
2. D.H. Pradhan, Fault-Tolerant Computing Theory and Techniques, Prentice Hall, Englewood Cliffs, N.J., 1986.]]
[3]
3. D.P. Siewiorek and R. S. Schwartz, Reliable Computer Systems, Digital Press, Bedford, Mass., 1992.]]
[4]
4. R. Horst, D. Jewett, and D. Leonski, "The Risk of Data Corruption in Microprocessor-Based Systems," Proc. IEEE Fault-Tolerant Computing Symp. 23, 1993, pp. 576-585.]]
[5]
5. J.G. Tront, R.J. Armstrong, and J.V. Oak, "Software Techniques for Detecting Single-Event Upsets in Satellite Computers," IEEE Trans. Nuclear Science, Vol. NS-32, No. 6, Dec. 1985, pp. 4225- 4228.]]
[6]
6. A. J. Van de Goor, Testing Semiconductor Memories, John Wiley & Sons, New York, 1991.]]
[7]
7. S. J. Adams, "Hardware-Assisted Recovery from Transient Errors in Redundant Processing Systems," Proc. IEEE Fault-Tolerant Computing Symp. 19, 1989, pp. 517-519.]]
[8]
8. J. Sosnowski, "Transient Fault Tolerance in Redundant Systems," Proc. Sixth Ann. European Computer Conf., IEEE CS Press, Los Alamitos, Calif., 1992, pp. 280-285.]]
[9]
9. T.R.N. Rao and F. Fujawara, Error Control Codes for Computer Systems, Prentice Hall, 1989.]]
[10]
10. M. Nicolaidis, "Efficient Implementation of Self-Checking Adders and ALUs," Proc. IEEE Fault-Tolerant Computing Symp. 23, pp. 586-595.]]
[11]
11. R. Leveugle and G. Saucier, "Concurrent Checking in Dedicated Controllers," Proc. IEEE Int'l Conf. Computer Design, 1989, pp. 124-127.]]
[12]
12. H. Holzapfel and P. Horninger, "Fault-Tolerant VLSI Processor," F. Belli and W. Goerke, eds., Fault-Tolerant Computing Systems (Informatik Fachberichte 147). Springer Verlag, Berlin, 1987, pp. 72-82.]]
[13]
13. S.Y.H. Su and H. Ma, "Design for Diagnosibility and Reliability in VLSI Systems," Proc. IEEE Int'l Test Conf., 1988, pp. 888-897.]]
[14]
14. G.F. Sullivan, D.S. Wilson, and G.M. Masson, "Certification Trails and Software Design for Testability, Proc. IEFEE Int'l Test Conf., 1993, pp. 200-209.]]
[15]
15. A. Mahmood and E.J. McCluskey, "Concurrent Error Detection Using Watchdog Processors," IEEE Trans. Computers, Vol. C- 37, No. 2, 1988, pp. 160-174.]]
[16]
16. M.E. Schmid et al., "Upset Exposure by Means of Abstraction Verification," Proc. IEEE Fault-Tolerant Computing Symp. 12, 1982, pp. 138-143.]]
[17]
17. H. Madeira, G. Quadros, and J.G. Silva, "Experimental Evaluation of a Set of Simple Error Detection Mechanisms, " Proc. Euromicro Symp., North Holland, Amsterdam, 1990, pp. 513-520.]]
[18]
18. K.W. Li, J.R. Armstrong, and J.G Tront, "An HDL Simulation of the Effects of Single-Event Upsets on Microprocessor Program Flow," IEEE Trans. Nuclear Sience, Vol. NS-31, No. 6, Dec. 1984, pp. 1139-1144.]]
[19]
19. U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault Injection by Heavy-Ion Radiation," Proc. IEEE Fault-Tolerant Computing Symp. 19, 1989, pp. 340-347.]]
[20]
20. G. Miremadi et al., "Two Software Techniques for On-Line Error Detection, Proc. IEEE Fault-Tolerant Computing Symp. 21, 1991, pp. 328-335.]]
[21]
21. G.A.S. Wingate and C. Preece, "Performance Evaluation of a New Design Tool for Microprocessor Transient Fault Recovery," Microprocessing and Microprogramming, Vol. 27, 1989, pp. 801-808.]]
[22]
22. S.S. Yau and F.C. Chen, "An Approach to Concurrent Control Flow Checking," IEEE Trans. Soft. Engineering, Vol. SE-6, No. 3, Mar. 1980, pp. 126-137.]]
[23]
23. D.J. Lu, "Watchdog Processors and Structural Integrity Checking," IEEE Trans. Computers, Vol. C-31, No. 7, July 1982, pp. 681-685.]]
[24]
24. J. Sosnowski, "Concurrent Checking of Program Flow Using Single-Chip Microcomputers," Microprocessing and Microprogramming, Vol. 29, 1988, pp. 783-789.]]
[25]
25. M.A. Schuette and J.P. Shen, "Exploiting Instruction-Level Resource Parallelism for Transparent, Integrated Control Flow Monitoring," Proc. IEEE Fault-Tolerant Computing Symp. 21, 1991, pp. 318-325.]]
[26]
26. J. Sosnowski, "Detection of Control Flow Errors Using Signature and Checking Instructions," Proc. IEEE Int'l Test Conf., 1988, pp. 81-88.]]
[27]
27. K. Wilken and J.P. Shen, "Continuous Signature Monitoring, Low-Cost Concurrent Dtection of Processor Control Errors," IEEE Trans. Computer Aided Design, Vol. CAD-9, No 6, June 1990, pp. 629-641.]]
[28]
28. T. Michel, R. Leveugle, and G. Saucier, "A New Approach to Control Flow Checking Without Program Modification," Proc. IEEE Fault-Tolerant Computing Symp. 21, 1991, pp. 334-341.]]
[29]
29. M. Kameyama, L. Zheng, and T. Higushi, "Design of a Fault-Tolerant System Based on a Knowledge Engineering Approach and Its Application to a Digital Control," Systems and Computers in Japan, Vol. 19, No. 12, 1988, pp. 81-90.]]
[30]
30. D.R. Avresky, ed., Hardware and Software Fault Tolerance in Parallel Computing Systems, Ellis Horwood, Chichester, England, 1992.]]
[31]
31. S. Upadhyaya and K.K. Saluja, "A Watchdog Processor-Based General Rollback Technique with Multiple Retries," IEEE Trans. Software Eng., Vol. SE-12, No. 1, Jan. 1986, pp. 87-95.]]
[32]
32. Y. Tamir and M. Trembley, "High-Performance Fault-Tolerant VLSI Systems Using Microrollback," IEEE Trans. Computers, Vol. C-39, No. 4, Apr 1990, pp. 548-554.]]
[33]
33. N.S. Bowen and D. K. Pradhan, "Processor and Memory-Based Checkpoint and Rollback Recovery," Computer, Feb. 1993, pp. 22-31.]]
[34]
34. I. Koren, Z. Koren, and Y.H. Su, "Analysis of a Class of Recovery Procedures," IEEE Trans. Computers, Vol. C-35, No. 8, Aug. 1986, pp. 703-712.]]
[35]
35. D. Chih-Wei Chang and N.R. Saxena, "Concurrent Error Detection/Correction In HAL MMU Chip," Proc. IEEE Fault-Tolerant Computing Symp. 23, 1993, pp. 630-635.]]
[36]
36. A. M. Saleh and J.J. Serrano, "Reliability of Scrubbing Recovery Techniques for Memory Systems," IEEE Trans. Rel., Vol. R-39, No. 1, Apr. 1990, pp. 114-122.]]
[37]
37. H. Kopetz et al., "Tolerating Transient Faults In MARS," Proc. IEEE Fault-Tolerant Computing Symp. 20, 1990, pp. 466-473.]]

Cited By

View all
  • (2014)Developing Inherently Resilient Software Against Soft-Errors Based on Algorithm Level Inherent FeaturesJournal of Electronic Testing: Theory and Applications10.1007/s10836-014-5438-830:2(193-212)Online publication date: 1-Apr-2014
  • (2011)Using silent writes in low-power traffic-aware ECCProceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation10.5555/2045364.2045383(180-192)Online publication date: 26-Sep-2011
  • (2011)Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systemsProceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis10.1145/2039370.2039409(247-256)Online publication date: 9-Oct-2011
  • Show More Cited By

Index Terms

  1. Transient Fault Tolerance in Digital Systems
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image IEEE Micro
      IEEE Micro  Volume 14, Issue 1
      February 1994
      119 pages

      Publisher

      IEEE Computer Society Press

      Washington, DC, United States

      Publication History

      Published: 01 February 1994

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 03 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2014)Developing Inherently Resilient Software Against Soft-Errors Based on Algorithm Level Inherent FeaturesJournal of Electronic Testing: Theory and Applications10.1007/s10836-014-5438-830:2(193-212)Online publication date: 1-Apr-2014
      • (2011)Using silent writes in low-power traffic-aware ECCProceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation10.5555/2045364.2045383(180-192)Online publication date: 26-Sep-2011
      • (2011)Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systemsProceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis10.1145/2039370.2039409(247-256)Online publication date: 9-Oct-2011
      • (2010)Replica victim caching to improve cache reliability against transient errorsInternational Journal of High Performance Systems Architecture10.1504/IJHPSA.2010.0345432:3/4(229-239)Online publication date: 1-Aug-2010
      • (2009)Design optimization of time-and cost-constrained fault-tolerant embedded systems with checkpointing and replicationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2008.200316617:3(389-402)Online publication date: 1-Mar-2009
      • (2006)Software based fault toleranceUbiquity10.1145/1149633.11479952006:July(1-1)Online publication date: 1-Jul-2006
      • (2005)SEU tolerant device, circuit and processor designProceedings of the 42nd annual Design Automation Conference10.1145/1065579.1065586(5-10)Online publication date: 13-Jun-2005
      • (2005)Time-Constrained Failure Diagnosis in Distributed Embedded SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2005.3716:3(258-270)Online publication date: 1-Mar-2005
      • (2005)Replication CacheIEEE Transactions on Computers10.1109/TC.2005.20254:12(1547-1555)Online publication date: 1-Dec-2005
      • (2005)Checkpointing for the reliability of real-time systems with on-line fault detectionProceedings of the 2005 international conference on Embedded and Ubiquitous Computing10.1007/11596356_22(194-202)Online publication date: 6-Dec-2005
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media