research-article

Transient Fault Tolerance in Digital Systems

Author:

Janusz SosnowskiAuthors Info & Claims

IEEE Micro, Volume 14, Issue 1

Pages 24 - 35

https://doi.org/10.1109/40.259897

Published: 01 February 1994 Publication History

Abstract

It is hard to shield systems effectively from transient faults (fault avoidance techniques). So some other means must be employed to assure appropriate levels of transient fault tolerance (insensitivity to transient faults). They are based on fault-masking and fault recovery ideas. Having analyzed this problem, the author identifies critical design points and outlines some practical solutions that refer to efficient on-line detectors (detecting errors during the system operation) and error handling procedures. This framework provides a basis for understanding transient fault problems in digital systems. It can be helpful in selecting optimum techniques to mask or eliminate transient fault effects in developed systems.

References

[1]

1. B.W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, Reading, Mass., 1989.]]

Digital Library

[2]

2. D.H. Pradhan, Fault-Tolerant Computing Theory and Techniques, Prentice Hall, Englewood Cliffs, N.J., 1986.]]

Digital Library

[3]

3. D.P. Siewiorek and R. S. Schwartz, Reliable Computer Systems, Digital Press, Bedford, Mass., 1992.]]

Digital Library

[4]

4. R. Horst, D. Jewett, and D. Leonski, "The Risk of Data Corruption in Microprocessor-Based Systems," Proc. IEEE Fault-Tolerant Computing Symp. 23, 1993, pp. 576-585.]]

[5]

5. J.G. Tront, R.J. Armstrong, and J.V. Oak, "Software Techniques for Detecting Single-Event Upsets in Satellite Computers," IEEE Trans. Nuclear Science, Vol. NS-32, No. 6, Dec. 1985, pp. 4225- 4228.]]

[6]

6. A. J. Van de Goor, Testing Semiconductor Memories, John Wiley & Sons, New York, 1991.]]

Digital Library

[7]

7. S. J. Adams, "Hardware-Assisted Recovery from Transient Errors in Redundant Processing Systems," Proc. IEEE Fault-Tolerant Computing Symp. 19, 1989, pp. 517-519.]]

[8]

8. J. Sosnowski, "Transient Fault Tolerance in Redundant Systems," Proc. Sixth Ann. European Computer Conf., IEEE CS Press, Los Alamitos, Calif., 1992, pp. 280-285.]]

[9]

9. T.R.N. Rao and F. Fujawara, Error Control Codes for Computer Systems, Prentice Hall, 1989.]]

Digital Library

[10]

10. M. Nicolaidis, "Efficient Implementation of Self-Checking Adders and ALUs," Proc. IEEE Fault-Tolerant Computing Symp. 23, pp. 586-595.]]

[11]

11. R. Leveugle and G. Saucier, "Concurrent Checking in Dedicated Controllers," Proc. IEEE Int'l Conf. Computer Design, 1989, pp. 124-127.]]

[12]

12. H. Holzapfel and P. Horninger, "Fault-Tolerant VLSI Processor," F. Belli and W. Goerke, eds., Fault-Tolerant Computing Systems (Informatik Fachberichte 147). Springer Verlag, Berlin, 1987, pp. 72-82.]]

Digital Library

[13]

13. S.Y.H. Su and H. Ma, "Design for Diagnosibility and Reliability in VLSI Systems," Proc. IEEE Int'l Test Conf., 1988, pp. 888-897.]]

[14]

14. G.F. Sullivan, D.S. Wilson, and G.M. Masson, "Certification Trails and Software Design for Testability, Proc. IEFEE Int'l Test Conf., 1993, pp. 200-209.]]

Digital Library

[15]

15. A. Mahmood and E.J. McCluskey, "Concurrent Error Detection Using Watchdog Processors," IEEE Trans. Computers, Vol. C- 37, No. 2, 1988, pp. 160-174.]]

Digital Library

[16]

16. M.E. Schmid et al., "Upset Exposure by Means of Abstraction Verification," Proc. IEEE Fault-Tolerant Computing Symp. 12, 1982, pp. 138-143.]]

[17]

17. H. Madeira, G. Quadros, and J.G. Silva, "Experimental Evaluation of a Set of Simple Error Detection Mechanisms, " Proc. Euromicro Symp., North Holland, Amsterdam, 1990, pp. 513-520.]]

[18]

18. K.W. Li, J.R. Armstrong, and J.G Tront, "An HDL Simulation of the Effects of Single-Event Upsets on Microprocessor Program Flow," IEEE Trans. Nuclear Sience, Vol. NS-31, No. 6, Dec. 1984, pp. 1139-1144.]]

[19]

19. U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault Injection by Heavy-Ion Radiation," Proc. IEEE Fault-Tolerant Computing Symp. 19, 1989, pp. 340-347.]]

[20]

20. G. Miremadi et al., "Two Software Techniques for On-Line Error Detection, Proc. IEEE Fault-Tolerant Computing Symp. 21, 1991, pp. 328-335.]]

[21]

21. G.A.S. Wingate and C. Preece, "Performance Evaluation of a New Design Tool for Microprocessor Transient Fault Recovery," Microprocessing and Microprogramming, Vol. 27, 1989, pp. 801-808.]]

[22]

22. S.S. Yau and F.C. Chen, "An Approach to Concurrent Control Flow Checking," IEEE Trans. Soft. Engineering, Vol. SE-6, No. 3, Mar. 1980, pp. 126-137.]]

Digital Library

[23]

23. D.J. Lu, "Watchdog Processors and Structural Integrity Checking," IEEE Trans. Computers, Vol. C-31, No. 7, July 1982, pp. 681-685.]]

[24]

24. J. Sosnowski, "Concurrent Checking of Program Flow Using Single-Chip Microcomputers," Microprocessing and Microprogramming, Vol. 29, 1988, pp. 783-789.]]

[25]

25. M.A. Schuette and J.P. Shen, "Exploiting Instruction-Level Resource Parallelism for Transparent, Integrated Control Flow Monitoring," Proc. IEEE Fault-Tolerant Computing Symp. 21, 1991, pp. 318-325.]]

[26]

26. J. Sosnowski, "Detection of Control Flow Errors Using Signature and Checking Instructions," Proc. IEEE Int'l Test Conf., 1988, pp. 81-88.]]

[27]

27. K. Wilken and J.P. Shen, "Continuous Signature Monitoring, Low-Cost Concurrent Dtection of Processor Control Errors," IEEE Trans. Computer Aided Design, Vol. CAD-9, No 6, June 1990, pp. 629-641.]]

[28]

28. T. Michel, R. Leveugle, and G. Saucier, "A New Approach to Control Flow Checking Without Program Modification," Proc. IEEE Fault-Tolerant Computing Symp. 21, 1991, pp. 334-341.]]

[29]

29. M. Kameyama, L. Zheng, and T. Higushi, "Design of a Fault-Tolerant System Based on a Knowledge Engineering Approach and Its Application to a Digital Control," Systems and Computers in Japan, Vol. 19, No. 12, 1988, pp. 81-90.]]

[30]

30. D.R. Avresky, ed., Hardware and Software Fault Tolerance in Parallel Computing Systems, Ellis Horwood, Chichester, England, 1992.]]

Digital Library

[31]

31. S. Upadhyaya and K.K. Saluja, "A Watchdog Processor-Based General Rollback Technique with Multiple Retries," IEEE Trans. Software Eng., Vol. SE-12, No. 1, Jan. 1986, pp. 87-95.]]

Digital Library

[32]

32. Y. Tamir and M. Trembley, "High-Performance Fault-Tolerant VLSI Systems Using Microrollback," IEEE Trans. Computers, Vol. C-39, No. 4, Apr 1990, pp. 548-554.]]

Digital Library

[33]

33. N.S. Bowen and D. K. Pradhan, "Processor and Memory-Based Checkpoint and Rollback Recovery," Computer, Feb. 1993, pp. 22-31.]]

Digital Library

[34]

34. I. Koren, Z. Koren, and Y.H. Su, "Analysis of a Class of Recovery Procedures," IEEE Trans. Computers, Vol. C-35, No. 8, Aug. 1986, pp. 703-712.]]

Digital Library

[35]

35. D. Chih-Wei Chang and N.R. Saxena, "Concurrent Error Detection/Correction In HAL MMU Chip," Proc. IEEE Fault-Tolerant Computing Symp. 23, 1993, pp. 630-635.]]

[36]

36. A. M. Saleh and J.J. Serrano, "Reliability of Scrubbing Recovery Techniques for Memory Systems," IEEE Trans. Rel., Vol. R-39, No. 1, Apr. 1990, pp. 114-122.]]

[37]

37. H. Kopetz et al., "Tolerating Transient Faults In MARS," Proc. IEEE Fault-Tolerant Computing Symp. 20, 1990, pp. 466-473.]]

Cited By

Arasteh BMiremadi SRahmani A(2014)Developing Inherently Resilient Software Against Soft-Errors Based on Algorithm Level Inherent FeaturesJournal of Electronic Testing: Theory and Applications10.1007/s10836-014-5438-830:2(193-212)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1007/s10836-014-5438-8
Kishani MBaniasadi APedram H(2011)Using silent writes in low-power traffic-aware ECCProceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation10.5555/2045364.2045383(180-192)Online publication date: 26-Sep-2011
https://dl.acm.org/doi/10.5555/2045364.2045383
Huang JBlech JRaabe ABuckl CKnoll ADick RMadsen J(2011)Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systemsProceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis10.1145/2039370.2039409(247-256)Online publication date: 9-Oct-2011
https://dl.acm.org/doi/10.1145/2039370.2039409
Show More Cited By

Index Terms

Transient Fault Tolerance in Digital Systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Networks
  1. Network performance evaluation

Index terms have been assigned to the content through auto-classification.

Recommendations

Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
Permanent and Transient Fault Tolerance for Reconfigurable Nano-Crossbar Arrays

This paper studies fault tolerance in switching reconfigurable nano-crossbar arrays. Both permanent and transient faults are taken into account by independently assigning stuck-open and stuck-closed fault probabilities into crosspoints. In the presence ...
From Fault Classification to Fault Tolerance for Multi-Agent Systems

Comments

Information & Contributors

Information

Published In

cover image IEEE Micro

IEEE Micro Volume 14, Issue 1

February 1994

119 pages

ISSN:0272-1732

Issue’s Table of Contents

Copyright © Copyright © 1994 IEEE. All Rights Reserved.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 February 1994

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Arasteh BMiremadi SRahmani A(2014)Developing Inherently Resilient Software Against Soft-Errors Based on Algorithm Level Inherent FeaturesJournal of Electronic Testing: Theory and Applications10.1007/s10836-014-5438-830:2(193-212)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1007/s10836-014-5438-8
Kishani MBaniasadi APedram H(2011)Using silent writes in low-power traffic-aware ECCProceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation10.5555/2045364.2045383(180-192)Online publication date: 26-Sep-2011
https://dl.acm.org/doi/10.5555/2045364.2045383
Huang JBlech JRaabe ABuckl CKnoll ADick RMadsen J(2011)Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systemsProceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis10.1145/2039370.2039409(247-256)Online publication date: 9-Oct-2011
https://dl.acm.org/doi/10.1145/2039370.2039409
Zhang W(2010)Replica victim caching to improve cache reliability against transient errorsInternational Journal of High Performance Systems Architecture10.1504/IJHPSA.2010.0345432:3/4(229-239)Online publication date: 1-Aug-2010
https://dl.acm.org/doi/10.1504/IJHPSA.2010.034543
Pop PIzosimov VEles PPeng Z(2009)Design optimization of time-and cost-constrained fault-tolerant embedded systems with checkpointing and replicationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2008.200316617:3(389-402)Online publication date: 1-Mar-2009
https://dl.acm.org/doi/10.1109/TVLSI.2008.2003166
Saha G(2006)Software based fault toleranceUbiquity10.1145/1149633.11479952006:July(1-1)Online publication date: 1-Jul-2006
https://dl.acm.org/doi/10.1145/1149633.1147995
Heidergott WJoyner WMartin GKahng A(2005)SEU tolerant device, circuit and processor designProceedings of the 42nd annual Design Automation Conference10.1145/1065579.1065586(5-10)Online publication date: 13-Jun-2005
https://dl.acm.org/doi/10.1145/1065579.1065586
Kandasamy NHayes JMurray B(2005)Time-Constrained Failure Diagnosis in Distributed Embedded SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2005.3716:3(258-270)Online publication date: 1-Mar-2005
https://dl.acm.org/doi/10.1109/TPDS.2005.37
Zhang W(2005)Replication CacheIEEE Transactions on Computers10.1109/TC.2005.20254:12(1547-1555)Online publication date: 1-Dec-2005
https://dl.acm.org/doi/10.1109/TC.2005.202
Ryu SPark D(2005)Checkpointing for the reliability of real-time systems with on-line fault detectionProceedings of the 2005 international conference on Embedded and Ubiquitous Computing10.1007/11596356_22(194-202)Online publication date: 6-Dec-2005
https://dl.acm.org/doi/10.1007/11596356_22
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents