Abstract
As MOS device sizes continue shrinking, lower charges, for example those charges carried by single ionizing particles of naturally occurring radiation, are sufficient to upset the functioning of complex modern microprocessors. In order to handle these inevitable errors, designs should include fault-tolerant features so that the processors can continue to correctly perform despite the occurrence of errors. The main goal of this work is to develop architecture mechanisms to protect processors against the effect of such radiation-induced transient faults. It should first be noted that, from a program execution perspective, many faults manifest themselves as control flow errors that cause processors to violate the correct sequencing of instructions. We present here at first a basic compile-time signature assignment algorithm and describe a novel approach to improve the fault detection coverage of the basic algorithm. Moreover, to allow the processor to efficiently check the run-time sequence and detect control flow errors, we introduce an on-chip assigned-signature checker which is capable of executing three additional instructions (SIC, SIJ, SIJC). Second, since the very concept of simultaneous multi-threading (SMT) provides the necessary redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby effecting a completely error-free execution. This paper has focused on two crucial implementation issues introduced by this scheme: (1) the design trade-off between the fault detection coverage versus design costs; (2) the possible occurrence of deadlock situations.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Hennessy J.L., Patterson D.A.: Computer architecture: a quantitative approach. 3rd edn. Morgan Kaufmann Publishers, Inc. (2002)
Borkar, S.: Design challenges of technology scaling. IEEE Micro. (1999)
Yang, P. Chern, J.-H.: Design for reliability: the major challenge for VLSI. Proceedings of the IEEE (1993)
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: 27th international symposium on computer architecture (2000)
Hennessy, J.: The future of systems research. IEEE Comput. (1999)
Stackhouse, B., Bhimji, S., et al.: A 65 nm 2-billion transistor quad-core itanium processor. IEEE Trans. Solid-State Circuits (2009)
Quach, N.: High Availability and reliability in the itanium processor. IEEE Micro. (2000)
Sanda, P.N., Kellington, J.W., Kudva, P., Kalla, R., McBeth, R.B., Ackaret, J., Lockwood, R., Schumann, J., Jones, C.R.: Soft-error resilience of the IBM POWER6 processor. IBM J. Res. Dev. (2008)
Clarke, W.J., Alves, L.C., Dell, T.J., Elfering, H., Kubala, J.P., Lin, C., Mueller, M.J., Werner, K.: IBM System z10 design for RAS. IBM J. Res. Dev. (2009)
Ando, H., Yoshida, Y., Inoue, A., Sugiyama, I., Asakawa, T., Morita, K., Muta, T., Motokurumada, T., Okada, S., Yamashita, H., Satsukawa, Y., Konmoto, A., Yamashita, R., Sugiyama, H.: A 1.3-GHz Fifth-generation SPARC64 Microprocessor. IEEE J. Solid-State Circuits (2003)
Intel Corporation, (Santa Clara): IA-32 intel architecture software developer’s manuals (2006)
Wilken, K., Shen, J.P.: Continuous signature monitoring: low-cost concurrent-detection of processor control errors. IEEE Trans. Comput. Aided Des. (1990)
Ohlsson, J., Rimén, M., Gunneflo, U.: A study of the effects of transient fault injection into a 32-bit RISC with built-in watchdog. In: 29th international symposium on fault-tolerant computing (1991)
Schuette, M.A., Shen, J.P.: Processor control flow monitoring using signatured instruction streams. IEEE Trans. Comput. (1987)
Mohmood, A., McCluskey, E.J.: Concurrent error detection using watchdog processors—a survey. IEEE Trans. Comput. (1988)
Schuette, M.A., Shen, J.P.: Exploiting instruction-level parallelism for integrated control-flow checking. IEEE Trans. Comput. (1994)
Warter, N.J., Hwu, W.-M.W.: A software based approach to achieving optimal performance for signature control flow checking. 20th international symposium on fault-tolerant computing (1990)
Michel, T., Leveugle, R., Saucier, G.: A new approach to control flow checking without program modification. In: 21st international symposium on fault-tolerant computing (1991)
Alkhalifa, Z., Nair, S., Krishnamurthy, N., Abraham, J.A.: Design and evaluation of system-level checks for on-line control flow error detection. IEEE Trans. Parallel Distrib. Syst. (1999)
Shirvani, P.P., McCluskey, E.J.: Fault-tolerant systems in a space environment: The CRC ARGOS Project. Tech. Rep. CRC-TR 98-2, Stanford University (1998)
Bagchi, S., Srinivasan, B., Whisnant, K., Kalbarczyk, Z., Iyer, R.K.: Hierarchical error detection in a software implemented fault tolerance (SIFT) environment. IEEE Trans. Knowl. Data Eng. (2000)
Oh, N., Shirvani, P.P., McCluskey, E.J.: Control-flow checking by software signatures. IEEE Trans. Reliab. (2002)
Aho A.V., Sethi R., Ullman J.D.: Compilers: Principles, techniques, and tools. Addison-Wesley Publishing Company, Wokingham, UK (1986)
Borin, E., Wang, C., Wu, Y., Araujo, G.: Dynamic binary control-flow errors detecttion. ACM SIGARCH Computer Architecture News (2005)
Saxena, N.R., McCluskey, E.J. Dependable adaptive computing systems- the ROAR project. In: 1998 IEEE international conference on systems, man and cybernetics (1998)
Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: 29th international symposium on fault-tolerant computing (1999)
Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: 29th international symposium on computer architecture (2002)
Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th international symposium on computer architecture (2002)
Ray, J., Hoe, J.C., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: 34th international symposium on microarchitecture (2001)
Smolens, J.C., Kim, J., Hoe, J.C., Falsafi, B.: Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In: 37th international symposium on microarchitecture (2004)
Bossen, D.C., Tendler, J.M., Reick, K.: Power4 system design for high reliability. IEEE Micro. (2002)
Mukherjee, S.S., Weaver, C., Emer, J., Reinhardt, S.K., Austin, T.: A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: 36th international symposium on microarchitecture (2003)
Mendelson, A., Suri, N.: Designing high-performance & reliable superscalar architectures the out of order reliable superscalar (O3RS) approach. In: International conference on dependable systems and networks (2000)
Kang, D., Gaudiot, J.-L.: Speculation control for simultaneous multithreading. In: 18th international parallel and distributed processing symposium (2004)
Compaq Computer Co., Massachusetts: Alpha 21264/EV68CB and 21264/EV68DC Hardware Reference Manual, 1.1 ed. (2001)
Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: maximizing on-chip parallelism. In: 22nd international symposium on computer architecture (1995)
Silberschatz A., Galvin P.B., Gagne G.: Applied operating system concepts. John Wiley & Sons, Inc. (2000)
Raasch, S.E., Reinhardt, S.K.: The impact of resource partitioning on SMT processors. In: 12th international conference on parallel architectures and compilation techniques (2003)
Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Tech. Rep. 1342, University of Wisconsin-Madison Computer Sciences Department (1997)
KleinOsowski, A., Lilja D.J.: MinneSPEC: a new SPEC benchmark workload for simulation-based computer architecture research. Tech. Rep. ARCTiC Lab No. 02–08, University of Minnesota, Minneapolis (2002)
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Li, X., Gaudiot, JL. Tolerating Radiation-Induced Transient Faults in Modern Processors. Int J Parallel Prog 38, 85–116 (2010). https://doi.org/10.1007/s10766-009-0114-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-009-0114-9