Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

A Survey on Multithreading Alternatives for Soft Error Fault Tolerance

Published: 27 March 2019 Publication History

Abstract

Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce higher soft error rates. This trend makes reliability a primary design constraint for computer systems. Redundant multithreading (RMT) makes use of parallelism in modern systems by employing thread-level time redundancy for fault detection and recovery. RMT can detect faults by running identical copies of the program as separate threads in parallel execution units with identical inputs and comparing their outputs. In this article, we present a survey of RMT implementations at different architectural levels with several design considerations. We explain the implementations in seminal papers and their extensions and discuss the design choices employed by the techniques. We review both hardware and software approaches by presenting the main characteristics and analyze the studies with different design choices regarding their strengths and weaknesses. We also present a classification to help potential users find a suitable method for their requirement and to guide researchers planning to work on this area by providing insights into the future trend.

References

[1]
2003. Cisco 12000 Single Event Upset Failures Overview and Work Around Summary. Retrieved from http://www.cisco.com/en/US/ts/fn/200/fn25994.html. Cisco Systems.
[2]
T. M. Aamodt, W. W. L. Fung, and T. G. Rogers. 2018. General-Purpose Graphics Processor Architectures. Morgan and Claypool Publishers.
[3]
N. Abu-Ghazeleh, J. Sharkey, D. Ponomarev, and K. Ghose. 2006. Exploiting short-lived values for low-overhead transient fault recovery. In Workshop on Architectural Support for Gigascale Integration (ASGI’06).
[4]
A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (Jan. 2004), 11--33.
[5]
M. T. Bohr and I. A. Young. 2017. CMOS scaling trends and beyond. IEEE Micro 37, 6 (2017), 20--29.
[6]
S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov. 2005), 10--16.
[7]
K. H. Chen, J. J. Chen, F. Kriebel, S. Rehman, M. Shafique, and J. Henkel. 2016. Task mapping for redundant multithreading in multi-cores with reliability and performance heterogeneity. IEEE Transactions on Computers 65, 11 (2016), 3441--3455.
[8]
K. H. Chen, G. V. Der Bruggen, and J. J. Chen. 2018. Reliability optimization on multi-core systems with multi-tasking and redundant multi-threading. IEEE Transactions on Computers 67, 4 (2018), 484--497.
[9]
Y. Chen and P. Chen. 2016. A software-based redundant execution programming model for transient fault detection and correction. In International Conference on Parallel Processing Workshops (ICPPW’16).
[10]
J. A. Clark and D. K. Pradhan. 1995. Fault injection. Computer 28, 6 (1995), 47--56.
[11]
M. Daňhel, F. Štěpánek, and H. Kubátová. 2017. Dependability prediction involving temporal redundancy and the effect of transient faults. In 2017 Euromicro Conference on Digital System Design (DSD’17). 360--363.
[12]
M. Dimitrov, M. Mantor, and H. Zhou. 2009. Understanding software approaches for GPGPU reliability. In Workshop on General Purpose Processing on Graphics Processing Units (GPGPU’09).
[13]
B. Döbel and H. Härtig. 2014. Can we put concurrency back into redundant multithreading? In International Conference on Embedded Software (EMSOFT’14).
[14]
B. Döbel, H. Härtig, and M. Engel. 2012. Operating system support for redundant multithreading. In International Conference on Embedded Software (EMSOFT’12).
[15]
J. Dong, L. Zhang, Y. Han, G. Yan, and X. Li. 2009. Variation-aware scheduling for chip multiprocessors with thread level redundancy. In 15th IEEE Pacific Rim International Symposium on Dependable Computing. 17--22.
[16]
B. Fechner. 2006. A result propagation scheme for redundant multithreaded systems. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications 8 Conference on Real-Time Computing Systems and Applications (PDPTA’06). 64--69.
[17]
J. Fu, Q. Yang, R. Poss, C. R. Jesshope, and C. Zhang. 2013. On-demand thread-level fault detection in a concurrent programming environment. In 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’13). 255--262.
[18]
J. Fu, Q. Yang, R. Poss, C. R. Jesshope, and C. Zhang. 2014. A fault detection mechanism in a data-flow scheduled multithreaded processor. In 2014 Design, Automation Test in Europe Conference Exhibition (DATE’14). 1--4.
[19]
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003. 98--109.
[20]
M. A. Gomaa and T. N. Vijaykumar. 2005. Opportunistic transient-fault detection. In International Symposium on Computer Architecture (ISCA’05).
[21]
R. Gong, K. Dai, and Z. Wang. 2008. Transient fault recovery on chip multiprocessor based on dual core redundancy and context saving. In 2008 9th International Conference for Young Computer Scientists. 148--153.
[22]
R. Gong, K. Dai, and Z. Wang. 2008. Transient fault tolerance on chip multiprocessor based on dual and triple core redundancy. In 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing. 273--280.
[23]
B. Greskamp and J. Torrellas. 2007. Paceline: Improving single-thread performance in nanoscale CMPs through core overclocking. In 16th International Conference on Parallel Architecture and Compilation Techniques. 213--224.
[24]
M. Gupta, D. Lowell, J. Kalamatianos, S. Raasch, V. Sridharan, D. Tullsen, and R. Gupta. 2017. Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading. In Design Automation Conference (DAC’17).
[25]
P. Hazucha, T. Karnik, J. Maiz, S. Walstra, B. Bloechel, J. Tschanz, G. Dermer, S. Hareland, P. Armstrong, and S. Borkar. 2003. Neutron soft error rate measurements in a 90-nm CMOS process and scaling trends in SRAM from 0.25-/spl mu/m to 90-nm generation. In IEEE International Electron Devices Meeting, 2003. 21.5.1--21.5.4.
[26]
M. Y. Hsiao. 1970. A class of optimal minimum odd-weight-column SEC-DED codes. IBM Journal of Research and Development 14, 4 (July 1970), 395--401.
[27]
S. Hukerikar and R. F. Lucas. 2016. Rolex: Resilience-oriented language extensions for extreme-scale systems. Journal of Supercomputing 72, 12 (2016), 4662--4695.
[28]
S. Hukerikar, K. Teranishi, P. C. Diniz, and R. F. Lucas. 2014. An evaluation of lazy fault detection based on adaptive redundant multithreading. In High Performance Extreme Computing Conference (HPEC’14).
[29]
S. Hukerikar, K. Teranishi, P. C. Diniz, and R. F. Lucas. 2014. Opportunistic application-level fault detection through adaptive redundant multithreading. In International Conference on High Performance Computing and Simulation.
[30]
S. Hukerikar, K. Teranishi, P. C. Diniz, and R. F. Lucas. 2018. RedThreads: An interface for application-level fault detection/correction through adaptive redundant multithreading. International Journal of Parallel Programming 46, 2 (2018), 225--251.
[31]
H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In International Symposium on Microarchitecture (MICRO’12).
[32]
R. Kalayappan and S. R. Sarangi. 2015. FluidCheck: A redundant threading-based approach for reliable execution in manycore processors. ACM Transactions on Architecture and Code Optimization 12, 4 (Dec. 2015), 55:1--55:26.
[33]
I. Koren and S. Y. H. Su. 1979. Reliability analysis of n-modular redundancy systems with intermittent and permanent faults. IEEE Transactions on Computers C-28, 7 (July 1979), 514--520.
[34]
E. Koser, K. Berthold, R. Kumar Pujari, and W. Stechele. 2016. A chip-level redundant threading (CRT) scheme for shared-memory protection. In International Conference on High Performance Computing Simulation. 116--124.
[35]
F. Kriebel, S. Rehman, M. Shafique, and J. Henkel. 2016. ageOpt-RMT: Compiler-driven variation-aware aging optimization for redundant multithreading. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC’16).
[36]
C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar. 2007. Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 317--326.
[37]
C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization (CGO’04).
[38]
D. Lyons. 2000. Sun Screen. Retrieved from http://members.forbes.com/global/2000/1113/0323026a.html. Forbes Magazine.
[39]
Y. Ma and H. Zhou. 2006. Efficient transient-fault tolerance for multithreaded processors using dual-thread execution. In International Conference on Computer Design (ICCD’06).
[40]
N. Madan and R. Balasubramonian. 2006. Exploiting eager register release in a redundantly multi-threaded processor. In 2nd Workshop on Architectural Reliability (WAR-2’06), Held in Conjunction with MICRO-39.
[41]
N. Madan and R. Balasubramonian. 2006. A first-order analysis of power overheads of redundant multi-threading. In 2nd Workshop on System Effects of Logic Soft Errors (SELSE-2’06).
[42]
N. Madan and R. Balasubramonian. 2007. Power efficient approaches to redundant multithreading. IEEE Transactions on Parallel and Distributed Systems 18, 8 (Aug. 2007), 1066--1079.
[43]
K. Mitropoulou, V. Porpodas, and T. M. Jones. 2016. COMET: Communication-optimised multi-threaded error-detection technique. In International Conference on Compilers, Architectures, and Synthesis of Embedded Systems (CASES).
[44]
S. S. Mukherjee. 2008. Architecture Design for Soft Errors. Morgan Kaufmann Publishers, San Francisco, CA.
[45]
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. 2005. The soft error problem: An architectural perspective. In 11th International Symposium on High-Performance Computer Architecture. 243--247.
[46]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 99--110.
[47]
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’03).
[48]
J. V. Neumann. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies, Vol. 34. Princeton University Press, 43--99.
[49]
K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. 1996. The case for a single-chip multiprocessor. SIGPLAN Notices 31, 9 (Sept. 1996), 2--11.
[50]
A. Parashar, S. Gurumurthi, and A. Sivasubramaniam. 2006. SlicK: Slice-based locality exploitation for efficient redundant multithreading. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems.
[51]
F. Pouyan, A. Azarpeyvand, S. Safari, and S. M. Fakhraie. 2015. Reliability aware simultaneous multithreaded architecture using online architectural vulnerability factor estimation. IET Computers and Digital Techniques 9, 2 (2015), 124--133.
[52]
F. Pouyan, A. Azarpeyvand, S. Safari, and S. M. Fakhraie. 2016. Reliability aware throughput management of chip multi-processor architecture via thread migration. Journal of Supercomputing 72, 4 (2016), 1363--1380.
[53]
M. W. Rashid and M. C. Huang. 2008. Supporting highly-decoupled thread-level redundancy for parallel programs. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. 393--404.
[54]
M. W. Rashid, E. J. Tan, M. C. Huang, and D. H. Albonesi. 2005. Exploiting coarse-grain verification parallelism for power-efficient fault tolerance. In 14th International Conference on Parallel Architectures and Compilation Techniques.
[55]
V. K. Reddy, S. Parthasarathy, and E. Rotenberg. 2006. Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06).
[56]
S. K. Reinhardt and S. S. Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA’00).
[57]
E. Rotenberg. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (FTCS’99).
[58]
D. Sánchez, J. L. Aragón, and J. M. García. 2008. Evaluating dynamic core coupling in a scalable tiled-CMP architecture. In International Workshop on Duplicating, Deconstructing, and Debunking (WDDD’08) in conjunction with ISCA'08.
[59]
D. Sánchez, J. L. Aragon, and J. M. Garcia. 2009. Extending SRT for parallel applications in tiled-CMP architectures. In International Symposium on Parallel and Distributed Processing (IPDPS’09).
[60]
D. Sánchez, J. L. Aragón, and J. M. García. 2009. REPAS: Reliable execution for parallel applications in tiled-CMPs. In Euro-Par 2009 Parallel Processing. Springer, Berlin, 321--333.
[61]
D. Sánchez, J. L. Aragón, and J. M. García. 2012. A fault-tolerant architecture for parallel applications in tiled-CMPs. Journal of Supercomputing 61, 3 (Sept. 2012), 997--1023.
[62]
E. Schuchman and T. N. Vijaykumar. 2007. BlackJack: Hard error detection with redundant threads on SMT. In International Conference on Dependable Systems and Networks (DSN’07).
[63]
J. Sharkey, N. Abu-Ghazeleh, D. Ponomarev, K. Ghose, and A. Aggarwal. 2006. Trade-offs in transient fault recovery schemes for redundant multithreaded processors. In International Conference on High-Performance Computing (HiPC’06).
[64]
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of International Conference on Dependable Systems and Networks.
[65]
T. Siddiqua and S. Gurumurthi. 2009. Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems. 121--130.
[66]
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. 2006. Reunion: Complexity-effective multicore redundancy. In International Symposium on Microarchitecture (MICRO’06).
[67]
H. So, M. Didehban, Y. Ko, A. Shrivastava, and K. Lee. 2018. EXPERT: Effective and flexible error protection by redundant multithreading. In Design, Automation and Test in Europe Conference and Exhibition.
[68]
P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson. 2009. Power efficient redundant execution for chip multiprocessors. In Proceedings of the 3rd Workshop on Dependable and Secure Nanocomputing in Conjunction with DSN. 1--6.
[69]
P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson. 2010. Energy-efficient fault tolerance in chip multiprocessors using critical value forwarding. In IEEE/IFIP International Conference on Dependable Systems Networks. 121--130.
[70]
P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson. 2010. Energy-efficient redundant execution for chip multiprocessors. In Proceedings of the 20th Symposium on Great Lakes Symposium on VLSI. 143--146.
[71]
P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson. 2010. Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors. In Design, Automation Test in Europe Conference Exhibition. 1572--1577.
[72]
P. Subramanyan, V. Singh, K. K. Saluja, and E. Larsson. 2011. Adaptive execution assistance for multiplexed fault-tolerant chip multiprocessors. In 2011 IEEE 29th International Conference on Computer Design (ICCD’11). 419--426.
[73]
K. Sundaramoorthy, Z. Purser, and E. Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. In International Conference on Architectural Support for Programming Languages and Operating Systems.
[74]
D. M. Tullsen, S. J. Eggers, and H. M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In International Symposium on Computer Architecture (ISCA’95).
[75]
N. J. van Eck and L. Waltman. 2014. Visualizing bibliometric networks. In Measuring Scholarly Impact, Y. Ding, R. Rousseau, and D. Wolfram (Eds.). Springer, 285--320.
[76]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng. 2002. Transient-fault recovery using simultaneous multithreading. In International Symposium on Computer Architecture (ISCA’02).
[77]
J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. 2014. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture.
[78]
C. Wang, H. Kim, Y. Wu, and V. Ying. 2007. Compiler-managed software-based redundant multi-threading for transient fault detection. In International Symposium on Code Generation and Optimization.
[79]
N. J. Wang and S. J. Patel. 2005. ReStore: Symptom based soft error detection in microprocessors. In International Conference on Dependable Systems and Networks (DSN’05).
[80]
N. J. Wang, A. Mahesri, and S. J. Patel. 2007. Examining ACE analysis reliability estimates using fault-injection. SIGARCH Computer Architecture News 35, 2 (June 2007), 460--469.
[81]
C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In 31st Annual International Symposium on Computer Architecture.
[82]
J. Yu, D. Jian, Z. Wu, and H. Liu. 2011. Thread-level redundancy fault tolerant CMP based on relaxed input replication. In 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT’11). 544--549.
[83]
J. J. Zhang, T. Gu, K. Basu, and S. Garg. 2018. Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In 36th VLSI Test Symposium (VTS’18).
[84]
Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. 2010. DAFT: Decoupled acyclic fault tolerance. In International Conference on Parallel Architectures and Compilation Techniques (PACT’10).

Cited By

View all
  • (2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
  • (2024)Compiler-Managed Replication of CUDA Kernels for Reliable Execution of GPGPU ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662450254233:14Online publication date: 18-Apr-2024
  • Show More Cited By

Index Terms

  1. A Survey on Multithreading Alternatives for Soft Error Fault Tolerance

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 52, Issue 2
      March 2020
      770 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3320149
      • Editor:
      • Sartaj Sahni
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 March 2019
      Accepted: 01 December 2018
      Revised: 01 November 2018
      Received: 01 August 2018
      Published in CSUR Volume 52, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Soft error
      2. redundant multithreading
      3. thread-level redundancy

      Qualifiers

      • Survey
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)101
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 15 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
      • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
      • (2024)Compiler-Managed Replication of CUDA Kernels for Reliable Execution of GPGPU ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662450254233:14Online publication date: 18-Apr-2024
      • (2024)Dynamic Triple Modular Redundancy in Interleaved Hardware Threads: An Alternative Solution to Lockstep Multi-Cores for Fault-Tolerant SystemsIEEE Access10.1109/ACCESS.2024.342557912(95720-95735)Online publication date: 2024
      • (2024)A RISC-V Fault-Tolerant Soft-Processor Based on Full/Partial Heterogeneous Dual-Core ProtectionIEEE Access10.1109/ACCESS.2024.336680612(30495-30506)Online publication date: 2024
      • (2024)Error-Tolerant Techniques for Classifiers Beyond Neural Networks for Dependable Machine LearningDesign and Applications of Emerging Computer Systems10.1007/978-3-031-42478-6_7(185-207)Online publication date: 14-Jan-2024
      • (2023)Bare-Metal Redundant Multi-Threading on Multicore SoCs Under Neutron IrradiationIEEE Transactions on Nuclear Science10.1109/TNS.2023.324712970:8(1643-1651)Online publication date: Aug-2023
      • (2022)Evaluation of Dynamic Triple Modular Redundancy in an Interleaved-Multi-Threading RISC-V CoreJournal of Low Power Electronics and Applications10.3390/jlpea1301000213:1(2)Online publication date: 28-Dec-2022
      • (2022)Hybrid Lockstep Technique for Soft Error MitigationIEEE Transactions on Nuclear Science10.1109/TNS.2022.314986769:7(1574-1581)Online publication date: Jul-2022
      • (2022)Selective Neuron Re-Computation (SNRC) for Error-Tolerant Neural NetworksIEEE Transactions on Computers10.1109/TC.2021.305699271:3(684-695)Online publication date: 1-Mar-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media