Abstract
Nowadays, hardware reliability is considered a first-class issue along with performance and energy efficiency. The increasing scaling technology and subsequent supply voltage reductions, together with temperature fluctuations, augment the susceptibility of architectures to errors.
With the development of CMPs, the interest for using parallel applications has increased. Previous proposals for providing fault detection and recovery have been mainly based on redundant execution over different cores. RMT (Redundant Multi-Threading) is a family of techniques based on SMT (Simultaneous Multi-Threading) processors in which two independent threads (master and slave), fed with the same inputs, redundantly execute the same instructions, in order to detect faults by checking their outputs. In this paper, we study the under-explored architectural support of RMT techniques to reliably execute shared-memory applications in tiled-CMPs.
Initially, we show how atomic operations induce serialization points between master and slave threads, degrading the execution time by 35% for several parallel scientific and multimedia benchmarks. To address this issue, we introduce REPAS (Reliable Execution of Parallel ApplicationS in tiled-CMPs), a novel RMT mechanism to provide reliable execution in shared-memory applications in environments prone to transient faults. REPAS architecture only needs few extra hardware since the redundant execution is performed within 2-way SMT cores in which the majority of hardware is shared. Experimental results show that REPAS is able to provide fault tolerance against soft errors with a lower execution time overhead (around 25% including the cost of redundancy) in comparison to a non-redundant system than previous proposals while using less hardware resources. Additionally, we show that REPAS supports huge fault ratios with negligible impact on performance (less than 2% for a fault ratio of 100 faults per million cycles).
Similar content being viewed by others
References
Bartlett J, Gray J, Horst B (1987) Fault tolerance in tandem computer systems. In: The evolution of fault-tolerant systems. doi:10.1.59.6080
Blundell C, Martin MM, Wenisch TF (2009) Invisifence: performance-transparent memory ordering in conventional multiprocessors. In: Proc of the 36th annual international symposium on computer architecture (ISCA ’09), Austin, TX, USA, pp 233–244
Carretero J, Vera X, Chaparro P, Abella J (2008) On-line failure detection in memory order buffers. In: IEEE international test conference, ITC 2008, pp 1–10
Francisco J, Villa MEA, Garcýa JM (2016) Toward energy-efficient high-performance organizations of the memory hierarchy in chip-multiprocessors architectures. J Comput Sci Technol 6:1–7
Gniady C, Falsafi B (2002) Speculative sequential consistency with little custom storage. In: Proc of the 2002 international conference on parallel architectures and compilation techniques (PACT ’02), pp 179–188
Gomaa M, Scarbrough C, Vijaykumar TN, Pomeranz I (2003) Transient-fault recovery for chip multiprocessors. In: Proc of the 30th annual int’ symp on computer architecture (ISCA’03), San Diego, California
González A, Mahlke S, Mukherjee S, Sendag R, Chiou D, Yi JJ (2007) Reliability: fallacy or reality? IEEE MICRO 27(6). doi:10.1109/MM.2007.107
International VS, Weaver DL, Germond T (1992) The sparc architecture manual. doi:10.1.1.106.2805
Kumar S, Aggarwal A (2008) Speculative instruction validation for performance-reliability trade-off. In: Proc of the IEEE 14th int’ symp on high performance computer architecture (HPCA’08), Salt Lake City
Kumar R, Zyuban V, Tullsen DM (2005) Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proc of the 32th int’l symp on computer architecture (ISCA’05), Madison, Wisconsin
LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proc of the 37th annual IEEE/IFIP int’ conference on dependable systems and networks (DSN’07), Edinburgh, UK. doi:10.1109/DSN.2007.100
Li ML, Sasanka R, Adve SV, Chen KY, Debes E (2005) The alpbench benchmark suite for complex multimedia applications. In: Proc of the IEEE int symp on workload characterization, pp 34–45
Li ML, Ramachandran P, Sahoo S, Adve S, Adve V, Zhou Y (2008) Understanding the propagation of hard errors to software and implications for resilient system design. In: Proc of the 13th int’ conference on architectural support for programming languages and operating systems (ASPLOS’08), Seattle, WA
Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B, Werner B (2002) Simics: a full system simulation platform. Computer 35(2). doi:10.1109/2.982916
Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33(4). doi:10.1.1.109.5362
Martínez JF, Renau J, Huang MC, Prvulovic M, Torrellas J (2002) Cherry: checkpointed early resource recycling in out-of-order microprocessors. In: Proc of the int’ symp on microarchitecture (MICRO’02), Istanbul, Turkey. citeseer.ist.psu.edu/martinez02cherry.html
Mastipuram R, Wee EC (2004) Soft error’s impact on system reliability. Electronics Design, Strategy, News (EDN) pp 69–74. URL http://www.edn.com/article/CA454636.html
Mukherjee S (2008) Architecture design for soft errors. Morgan Kauffman, San Mateo
Mukherjee S, Kontz M, Reinhardt SK (2002) Detailed design and evaluation of redundant multithreading alternatives. In: Proc of the 29th annual int’ symp on computer architecture (ISCA’02), Anchorage, Alaska
Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. In: Proceedings of the 7th international conference on architectural support for programming languages and operating systems. ACM Press, New York, pp 2–11. doi:10.1145/237090.237140. http://doi.acm.org/10.1145/237090.237140
Rashid M, Huang M (2008) Supporting highly-decoupled thread-level redundancy for parallel programs. In: Proc of the 14th int’ symp on high performance computer architecture (HPCA’08), Salt Lake City
Reinhardt SK, Mukherjee S (2000) Transient fault detection via simultaneous multithreading. In: Proc of the 27th annual int’ symp on computer architecture (ISCA’00), Vancouver, British Columbia, Canada
Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit 56(2–3):77–87
Rotenberg E (1999) Ar-smt: A microarchitectural approach to fault tolerance in microprocessors. In: Proc of the 29th annual int’ symp on fault-tolerant computing (FTCS’99), Madison, Wisconsin
Sánchez D, Aragón JL, García JM (2008) Evaluating dynamic core coupling in a scalable tiled-cmp architecture. In: Proc of the 7th int workshop on duplicating, deconstructing, and debunking (WDDD’08). In conjunction with ISCA’08, Beijing, China
Sánchez D, Aragón JL, García JM (2009) Repas: reliable execution for parallel applications in tiled-cmps. In: Proc of the 15th int European conference on parallel and distributed computing (Euro-Par 2009), Delft, Netherlands, pp 321–333
Selse (2006) Selse ii final remarks. In: The 2nd workshop on system effects of logic soft errors
Smolens JC, Gold BT, Kim J, Falsafi B, Hoe JC, Nowatzyk AG (2004) Fingerprinting: Bounding soft-error-detection latency and bandwidth. IEEE MICRO 24(6). doi:10.1109/MM.2004.72
Smolens JC, Gold BT, Falsafi B, Hoe JC (2006) Reunion: Complexity-effective multicore redundancy. In: Proc of the 39th annual IEEE/ACM int’ symp on microarchitecture (MICRO 39), Orlando, Florida, p 42. doi:10.1109/MICRO.2006.42
Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Johnson P, Lee JW, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE MICRO 22(2):25–35
Vijaykumar T, Pomeranz I, Cheng K (2002) Transient fault recovery using simultaneous multithreading. In: Proc of the 29th annual int’ symp on computer architecture (ISCA’02), Anchorage, Alaska
Wang NJ, Patel SJ (2006) Restore: Symptom-based soft error detection in microprocessors. IEEE Trans Depend Secure Comput 3(3). doi:10.1109/TDSC.2006.40
Wenisch TF, Ailamaki A, Falsafi B, Moshovos A (2007) Mechanisms for store-wait-free multiprocessors, pp 266–277
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proc of the 22th int’ symp on computer architecture (ISCA’95), Santa Margherita Ligure, Italy
Ziegler J, Lanford WA (1981) The effect of sea level cosmic rays on electronic devices. J Appl Phys 52:4305–4312
Zielger JF, Puchner H (2004) SER-History, Trends and Challenges. Cypress Semiconductor Corporation
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sánchez, D., Aragón, J.L. & García, J.M. A fault-tolerant architecture for parallel applications in tiled-CMPs. J Supercomput 61, 997–1023 (2012). https://doi.org/10.1007/s11227-011-0670-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-011-0670-9