survey

Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

Authors:

Carles Hernandez,

Alessandro Cilardo,

Giuseppe Massari,

Federico Reghenzani,

William Fornaciari,

Marina Zapater,

Ariel Oleksiak,

Wojciech PiĄtek,

Jaume AbellaAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 53, Issue 5

Article No.: 95, Pages 1 - 32

https://doi.org/10.1145/3403956

Published: 28 September 2020 Publication History

Abstract

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

References

[1]

J. Abella, P. Chaparro, X. Vera, J. Carretero, and A. González. 2008. On-line failure detection and confinement in caches. In Proceedings of the 14th IEEE International On-line Testing Symposium. 3--9.

[2]

J. Abella, C. Hernandez, E. Quiñones, F. J. Cazorla, P. R. Conmy, M. Azkarate-askasua, J. Perez, E. Mezzetti, and T. Vardanega. 2015. WCET analysis methods: Pitfalls and challenges on their trustworthiness. In Proceedings of the 10th IEEE International Symposium on Industrial Embedded Systems (SIES’15). 1--10.

[3]

E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA, Research Report RR-8324.

[4]

E. Agullo, L. Giraud, P. Salas, and M. Zounon. 2016. Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38, 5 (2016), C560--C583. arXiv:https://doi.org/10.1137/15M1042115

Digital Library

[5]

M. Al-Fares, A. Loukissas, and A. Vahdat. 2008. A scalable, commodity data center network architecture. In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM’08). ACM, New York, NY, 63--74.

[6]

A. M. Al-Qawasmeh, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2015. Power and thermal-aware workload allocation in heterogeneous data centers. IEEE Trans. Comput. 64, 2 (Feb. 2015), 477--491.

[7]

R. Alverson, D. Roweth, and L. Kaplan. 2010. The Gemini system interconnect. In Proceedings of the 18th IEEE Symposium on High Performance Interconnects. 83--87.

[8]

A. Andrzejak and L. Silva. 2007. Deterministic models of software aging and optimal rejuvenation schedules. In Proceedings of the 10th IFIP/IEEE International Symposium on Integrated Network Management. 159--168.

[9]

ARM. 2017. ARM Reliability, Availability, and Serviceability (RAS) Specification—ARMv8, for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.

[10]

A. Avizienis. 1995. Software Fault Tolerance. Chapter 2: The Methodology of N-Version Programming. Wiley.

[11]

A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 1 (Jan. 2004), 11--33.

Digital Library

[12]

J. Bachan, S. B. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, P. H. Hargrove, and H. Ahmed. 2019. UPC++: A high-performance communication framework for asynchronous computation. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’19). 963--973.

[13]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). 1--11.

[14]

L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). 645--655.

[15]

H. R. Berenji, J. Ametha, and D. Vengerov. 2003. Inductive learning for fault diagnosis. In Proceedings of the 12th IEEE International Conference on Fuzzy Systems (FUZZ’03), Vol. 1. 726--731 vol.1.

[16]

D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. 2005. NonStop/spl reg/ advanced architecture. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). 12--21.

[17]

E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello. 2015. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15). Association for Computing Machinery, New York, NY, 275--278.

[18]

E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello. 2017. Toward general software level silent data corruption detection for parallel applications. IEEE Trans. Parallel Distrib. Syst. 28, 12 (Dec. 2017), 3642--3655.

[19]

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. 2011. Checkpointing strategies for parallel jobs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 33, 11 pages.

[20]

J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. 2010. Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example. In Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W’10). 2--7.

[21]

M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.

[22]

P. Bridges, K. Ferreira, M. Heroux, and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012. arXiv:1206.1390 [math.NA].

[23]

G. Bronevetsky and B. De Supinski. 2008. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS’08). ACM, New York, NY, 155--164.

[24]

U. Cabello, J. Rodríguez, A. Meneses, S. Mendoza, and D. Decouchant. 2014. Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism. In Proceedings of the 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE’14). IEEE, 1--7.

[25]

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1, 1 (2014). http://superfri.org/superfri/article/view/14.

[26]

M. Casas, B. de Supinski, G. Bronevetsky, and M. Schulz. 2012. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 91--100.

[27]

F. J. Cazorla, L. Kosmidis, E. Mezzetti, C. Hernandez, J. Abella, and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52, 1, Article 14 (Feb. 2019), 35 pages.

[28]

S. Cha, C. Chen, and L. S. Milor. 2014. System-level estimation of threshold voltage degradation due to NBTI with I/O measurements. In Proceedings of the IEEE International Reliability Physics Symposium. PR.1.1--PR.1.7.

[29]

C. S. Chan, Y. Jin, Y. K. Wu, K. Gross, K. Vaidyanathan, and T. S. Rosing. 2012. Fan-speed-aware scheduling of data intensive jobs. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’12). Association for Computing Machinery, New York, NY, 409--414.

[30]

C. S. Chan, B. Pan, K. Gross, K. Vaidyanathan, and T. S. Rosing. 2014. Correcting vibration-induced performance degradation in enterprise servers. SIGMETRICS Perform. Eval. Rev. 41, 3 (Jan. 2014), 83--88.

Digital Library

[31]

T. Chantem, X. S. Hu, and R. P. Dick. 2011. Temperature-aware scheduling and assignment for hard real-time applications on MPSoCs. IEEE Trans. Very Large Scale Integr. Syst. 19, 10 (Oct 2011), 1884--1897.

Digital Library

[32]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’05). Association for Computing Machinery, New York, NY, 519--538.

[33]

M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. 2004. Path-based faliure and evolution management. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation (NSDI’04). USENIX Association, Berkeley, CA, 23--23. http://dl.acm.org/citation.cfm?id=1251175.1251198

[34]

M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. 2002. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings International Conference on Dependable Systems and Networks. 595--604.

[35]

Z. Chen. 2011. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC’11). ACM, New York, NY, 73--84.

Digital Library

[36]

Z. Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 167--176.

Digital Library

[37]

J. Choi, C. Y. Cher, H. Franke, H. Hamann, A. Weger, and P. Bose. 2007. Thermal-aware task scheduling at the system software level. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’07). ACM, New York, NY, 213--218.

[38]

A. K. Coskun, T. S. Rosing, and K. C. Gross. 2008. Temperature management in multiprocessor socs using online learning. In Proceedings of the 45th Annual Design Automation Conference (DAC’08). ACM, New York, NY, 890--893.

[39]

A. K. Coskun, T. S. Rosing, K. Mihic, G. De Micheli, and Y. Leblebici. 2006. Analysis and optimization of MPSoC reliability. J. Low Power Electron. 2, 1 (2006), 56--69.

[40]

G. Da Costa, A. Oleksiak, W. Piatek, J. Salom, and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers, S. Klingert, M. Chinnici, and M. Rey Porto (Eds.). Springer International Publishing, Cham, 102--119.

[41]

IEEE IRDS Technical Council. 2018. International Roadmap for Devices and Systems. IEEE.

[42]

L. Cupertino, G. Da Costa, A. Oleksiak, W. Piatek, J.M. Pierson, J. Salom, L. Siso, P. Stolf, H. Sun, and T. Zilio. 2015. Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Netw. 25 (2015), 535--553.

Digital Library

[43]

P. Czarnul, J. Proficz, and A. Krzywaniak. 2019. Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments. Sci. Program. 2019, 4 (2019), 8348791.

[44]

W. J. Dally. 1991. Express cubes: Improving the performance of k-ary n-cube interconnection networks. IEEE Trans. Comput. 40, 9 (Sep. 1991), 1016--1023.

Digital Library

[45]

D. Dasari, B. Akesson, V. Nélis, M. A. Awan, and S. M. Petters. 2013. Identifying the sources of unpredictability in COTS-based multicore systems. In Proceedings of the 8th IEEE International Symposium on Industrial Embedded Systems (SIES’13). 39--48.

[46]

D. Dauwe, R. Jhaveri, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2018. Optimizing checkpoint intervals for reduced energy use in exascale systems. In Proceedings of the 8th International Green and Sustainable Computing Conference (IGSC’17). 1--8.

[47]

D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2017. An analysis of resilience techniques for exascale computing platforms. Proceedings of the IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. 914--923.

[48]

D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2018. An analysis of multilevel checkpoint performance models. Proceedings of the IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. 783--792.

[49]

D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2018. Resilience-aware resource management for exascale computing systems. IEEE Trans. Sustain. Comput. 3, 4 (2018), 332--345.

[50]

T. Davies and Z. Chen. 2013. Correcting soft errors online in LU factorization. In Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC’13). ACM, New York, NY, 167--178.

[51]

R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43, 4, Article 35 (Oct. 2011), 44 pages.

[52]

Narayan Desai. 2005. Cobalt: An open source platform for HPC system software research. Edinburgh BG/L System Software Workshop.

[53]

S. Di, L. Bautista-Gome, and F. Cappello. 2014. Optimization of a multilevel checkpoint model with uncertain execution scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 907--918.

[54]

S. Di, E. Berrocal, and F. Cappello. 2015. An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 271--280.

[55]

S. Di, M. S. Bouguerra, L. Bautista-Gomez, and F. Cappello. 2014. Optimization of multi-level checkpoint model for large scale HPC applications. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1181--1190.

[56]

S. Di and F. Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Trans. Parallel Distrib. Syst. 27, 10 (Oct. 2016), 2809--2823.

Digital Library

[57]

S. Di, H. Guo, R. Gupta, E. R. Pershey, M. Snir, and F. Cappello. 2019. Exploring properties and correlations of fatal events in a large-scale HPC system. IEEE Trans. Parallel Distrib. Syst. 30, 2 (Feb. 2019), 361--374.

Digital Library

[58]

S. Di, H. Guo, E. Pershey, M. Snir, and F. Cappello. 2019. Characterizing and understanding HPC job failures over the 2K-day life of IBM BlueGene/Q system. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’19). 473--484.

[59]

S. Di, Y. Robert, F. Vivien, and F. Cappello. 2017. Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 244--259.

Digital Library

[60]

C. Di-Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 610--621.

[61]

P. Dinda, X. Wang, J. Wang, C. Beauchene, and C. Hetland. 2018. Hard real-time scheduling for parallel run-time systems. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’18). ACM, New York, NY, 14--26.

[62]

J. Domke, T. Hoefler, and S. Matsuoka. 2014. Fail-in-place network design: Interaction between topology, routing algorithm and failures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 597--608.

[63]

J. Dongarra, T. Herault, and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.

[64]

S. D. Downing and D. F. Socie. 1982. Simple rainflow counting algorithms. Int. J. Fatigue 4, 1 (1982), 31--40.

[65]

B. Eghbalkhah, M. Kamal, H. Afzali-Kusha, A. Afzali-Kusha, M. B. Ghaznavi-Ghoushchi, and M. Pedram. 2015. Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectron. Reliabil. 55, 8 (2015), 1152--1162.

[66]

J. Elliott, M. Hoemmen, and F. Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1193--1202.

[67]

R. Entner. 2007. Modeling and Simulation of Negative Bias Temperature Instability. Ph.D. Dissertation. Technische Universitaet Wien, Institut fur Mikroelektronik.

[68]

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). 1--12.

[69]

J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, E. Cappe, A. Cilardo, L. Dragic, A. Dray, A. Duspara, W. Fornaciari, G. Guillaume, Y. Hoornenborg, A. Iranfar, M. Kovac, S. Libutti, B. Maitre, J. M. Martinez, G. Massari, H. Mlinaric, E. Papastefanakis, T. Picornell, I. Piljic, A. Pupykina, F. Reghenzani, I. Staub, R. Tornero, M. Zapater, and D. Zoni. 2017. MANGO: Exploring manycore architectures for next-GeneratiOn HPC systems. In Proceedings of the Euromicro Conference on Digital System Design (DSD’17). 478--485.

[70]

W. Fornaciari, G. Agosta, D. Atienza, C. Brandolese, L. Cammoun, L. Cremona, A. Cilardo, A. Farres, J. Flich, C. Hernandez, M. Kulchewski, S. Libutti, J.M. Martinez, G. Massari, A. Oleksiak, A. Pupykina, F. Reghenzani, R. Tornero, M. Zanella, M. Zapater, and D. Zoni. 2018. Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems. In Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’18). ACM, New York, NY, 187--194.

[71]

S. Fu and C. Xu. 2007. Quantifying temporal and spatial correlation of failure events for proactive management. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS’07). 175--184.

[72]

E. W. Fulp, G. A. Fink, and J. N. Haack. 2008. Predicting computer system failures using support vector machines. In Proceedings of the First USENIX Conference on Analysis of System Logs (WASL’08). USENIX Association, Berkeley, CA, 5--5. Retrieved from http://dl.acm.org/citation.cfm?id=1855886.1855891.

[73]

A. Gainaru, F. Cappello, M. Snir, and W. Kramer. 2012. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). 1--11.

[74]

R. Garg, A. Mohan, M. Sullivan, and G. Cooperman. 2018. CRUM: Checkpoint-restart support for CUDA’s unified memory. Proceedings of the IEEE International Conference on Cluster Computing (ICCC’18). 302--313.

[75]

S. Garg, A. van Moorsel, K. Vaidyanathan, and K. S. Trivedi. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering. 283--292.

[76]

P. Gill, N. Jain, and N. Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM’11). ACM, New York, NY, 350--361.

[77]

R. Gioiosa, J. C. Sancho, S. Jiang, and F. Petrini. 2005. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’05). 9--9.

[78]

M. Gottscho, M. Shoaib, S. Govindan, B. Sharma, D. Wang, and P. Gupta. 2017. Measuring the impact of memory errors on application performance. IEEE Comput. Architect. Lett. 16, 1 (Jan. 2017), 51--55.

[79]

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. 2011. VL2: A scalable and flexible data center network. Commun. ACM 54, 3 (Mar. 2011), 95--104.

Digital Library

[80]

C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. 2009. BCube: A high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM’09). ACM, New York, NY, 63--74.

[81]

G. Hamerly and C. Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 202--209. http://dl.acm.org/citation.cfm?id=645530.655825

[82]

S. Hamouda. 2019. Resilience in High-Level Parallel Programming Languages. Ph.D. Dissertation. The Australian National University.

[83]

J. L. Hellerstein, F. Zhang, and P. Shahabuddin. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IFIP/IEEE International Symposium on Integrated Network Management. 309--322.

[84]

R. L. Henderson. 1995. Job scheduling under the portable batch system. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (IPPS’95). Springer-Verlag, Berlin, 279--294.

Digital Library

[85]

M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley. 2005. An overview of the trilinos project. ACM Trans. Math. Softw. 31, 3 (Sept. 2005), 397--423.

Digital Library

[86]

G. A. Hoffmann. 2006. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag.

[87]

G. A. Hoffmann, K. S. Trivedi, and M. Malek. 2007. A best practice guide to resource forecasting for computing systems. IEEE Trans. Reliabil. 56, 4 (Dec. 2007), 615--628.

[88]

M. Y. Hsiao, W. C. Carter, J. W. Thomas, and W. R. Stringfellow. 1981. Reliability, availability, and serviceability of IBM computer systems: A quarter century of progress. IBM J. Res. Dev. 25, 5 (Sept. 1981), 453--468.

Digital Library

[89]

K.-H. Huang and J. A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C-33, 6 (June 1984), 518--528.

[90]

W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy. 2004. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st Design Automation Conference, 2004. 878--883.

Digital Library

[91]

G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliabil. 51, 3 (Sep. 2002), 350--357.

[92]

S. Hukerikar, P. C. Diniz, R. F. Lucas, and K. Teranishi. 2014. Opportunistic application-level fault detection through adaptive redundant multithreading. In Proceedings of the International Conference on High Performance Computing Simulation (HPCS’14). 243--250.

[93]

S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4, 3 (2017).

[94]

H. Hussain, S. Malik, A. Hameed, S. U. Khan, G. Bickler, N. Min-Allah, M. B. Qureshi, L. Zhang, W. Yongji, N. Ghani, J. Kolodziej, A. Y. Zomaya, C. Z. Xu, P. Balaji, A. Vishnu, F. Pinel, J. E. Pecero, D. Kliazovich, P. Bouvry, H. Li, L. Wang, D. Chen, and A. Rayes. 2013. A survey on resource allocation in high performance distributed computing systems. Parallel Comput. 39, 11 (2013), 709--736.

Digital Library

[95]

Z. Hussain, T. Znati, and R. Melhem. 2018. Partial redundancy in HPC systems with non-uniform node reliabilities. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Piscataway, NJ, Article 44, 11 pages. http://dl.acm.org/citation.cfm?id=3291656.3291715

[96]

Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability, Availability, and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.

[97]

ISO/DIS 26262: Road Vehicles—Functional Safety. 2011. International Organization for Standardization.

[98]

A. Iranfar, M. Kamal, A. Afzali-Kusha, M. Pedram, and D. Atienza. 2018. TheSPoT: Thermal stress-aware power and temperature management for multiprocessor systems-on-chip. IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst. 37, 8 (Aug. 2018), 1532--1545.

[99]

A. Iranfar, F. Terraneo, W. A. Simon, L. Dragić, I. Piljić, M. Zapater, W. Fornaciari, M. Kovač, and D. Atienza. 2017. Thermal characterization of next-generation workloads on heterogeneous MPSoCs. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’17). 286--291.

[100]

S. Jha, V. Formicola, C. Di Martino, M. A. Dalton, W. T. Kramer, Z. T. Kalbarczyk, and R. K. Iyer. 2017. Resiliency of HPC interconnects: A case study of interconnect failures and recovery in blue waters. IEEE Trans. Depend. Secure Comput. 15 (2017), 915--930.

Digital Library

[101]

M. Jin, C. Liu, J. Kim, J. Kim, H. Shim, K. Kim, G. Kim, S. Lee, T. Uemura, M. Chang, T. An, J. Park, and S. Pae. 2016. Reliability characterization of 10 nm FinFET technology with multi-VT gate stack for low power and high performance. In Proceedings of the IEEE International Electron Devices Meeting (IEDM’16). 15.1.1--15.1.4.

[102]

S. Kannan, N. Farooqui, A. Gavrilovska, and K. Schwan. 2014. HeteroCheckpoint: Efficient checkpointing for accelerator-based systems. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 738--743.

[103]

E. Kiciman and A. Fox. 2005. Detecting application-level failures in component-based Internet services. IEEE Trans. Neural Netw. 16, 5 (Sep. 2005), 1027--1041.

Digital Library

[104]

T. Kim, Z. Sun, C. Cook, H. Zhao, R. Li, D. Wong, and S. X. . Tan. 2016. Invited: Cross-layer modeling and optimization for electromigration induced reliability. In Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference (DAC’16). 1--6.

Digital Library

[105]

M. Kumar, S. Gupta, T. Patel, M. Wilder, W. Shi, S. Fu, C. Engelmann, and D. Tiwari. 2018. Understanding and analyzing interconnect errors and network congestion on a large scale HPC system. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’18). 107--114.

[106]

K. Kurowski, A. Oleksiak, W. Piątek, T. Piontek, A. Przybyszewski, and J. Węglarz. 2013. DCworms -- A tool for simulation of energy efficiency in distributed computing infrastructures. Simul. Model. Pract. Theory 39 (2013), 135--151. S. I. Energy efficiency in grids and clouds.

[107]

R. Lal and G. Choi. 1998. Error and failure analysis of a UNIX server. In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 232--239.

[108]

J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. 2008. Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 1 (2008), 102--116. Retrieved from arXiv:https://doi.org/10.1137/040620394

Digital Library

[109]

J. C. Laprie (Ed.). 1995. Dependability—Its Attributes, Impairments and Means. Springer-Verlag, Berlin.

[110]

J. C. Laprie. 1995. Dependable computing and fault tolerance: Concepts and terminology. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing.

[111]

J. C. C. Laprie, A. Avizienis, and H. Kopetz (Eds.). 1992. Dependability: Basic Concepts and Terminology. Springer-Verlag, Berlin.

Digital Library

[112]

C. J. M. Lasance. 2003. Thermally driven reliability issues in microelectronic systems: Status-quo and challenges. Microelectron. Reliabil. 43, 12 (2003), 1969--1974.

[113]

C. E. Leiserson. 1985. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. Comput. 34, 10 (Oct. 1985), 892--901. http://dl.acm.org/citation.cfm?id=4492.4495

Digital Library

[114]

D. Levy and R. Chillarege. 2003. Early warning of failures through alarm analysis a case study in telecom voice mail systems. In Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE’03) 271--280.

[115]

X. Liang, J. Chen, D. Tao, S. Li, P. Wu, H. Li, K. Ouyang, Y. Liu, F. Song, and Z. Chen. 2017. Correcting soft errors online in fast Fourier transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 30, 12 pages.

[116]

Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo. 2006. BlueGene/L failure analysis and prediction models. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). 425--434.

Digital Library

[117]

T. T. Y. Lin and D. P. Siewiorek. 1990. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. Reliabil. 39, 4 (Oct. 1990), 419--432.

[118]

N. Losada, P. González, M. J. Martín, G. Bosilca, A. Bouteiller, and K. Teranishi. 2020. Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Gen. Comput. Syst. 106 (2020), 467--481.

[119]

R. E. Lyons and W. Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM J. Res. Dev. 6, 2 (Apr. 1962), 200--209.

Digital Library

[120]

C. D. Martino, W. Kramer, Z. Kalbarczyk, and R. Iyer. 2015. Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 25--36.

[121]

M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.

[122]

J. Meza, Q. Wu, S. Kumar, and O. Mutlu. 2015. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 415--426.

[123]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). 1--11.

[124]

Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.

[125]

F. Mulas, D. Atienza, A. Acquaviva, S. Carta, L. Benini, and G. De Micheli. 2009. Thermal balancing policy for multiprocessor stream computing platforms. IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst. 28, 12 (Dec. 2009), 1870--1882.

Digital Library

[126]

J. Murray, G. Hugues, and K. Kreutz-Delgado. 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of the Conference on Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP’03).

[127]

B. Narasimham, S. Gupta, D. Reed, J. K. Wang, N. Hendrickson, and H. Taufique. 2018. Scaling trends and bias dependence of the soft error rate of 16 nm and 7 nm FinFET SRAMs. In Proceedings of the IEEE International Reliability Physics Symposium (IRPS’18). 4C.1--1--4C.1--4.

[128]

S. Narayanasamy, A. K. Coskun, and B. Calder. 2007. Transient fault prediction based on anomalies in processor events. In Proceedings of the Design, Automation Test in Europe Conference Exhibition. 1--6.

[129]

F. A. Nassar and D. M. Andrews. 1985. A methodology for analysis of failure prediction data. In Proceedings of the 6th IEEE Real-Time Systems Symposium (RTSS’85). 160--166.

[130]

A. Nukada, H. Takizawa, and S. Matsuoka. 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 104--113.

[131]

A. Oleksiak, M. Kierzynka, W. Piatek, G. Agosta, A. Barenghi, C. Brandolese, W. Fornaciari, G. Pelosi, M. Cecowski, R. Plestenjak, J. Činkelj, M. Porrmann, J. Hagemeyer, R. Griessl, J. Lachmair, M. Peykanu, L. Tigges, M. vor dem Berge, W. Christmann, S. Krupop, A. Carbon, L. Cudennec, T. Goubier, J. M. Philippe, S. Rosinger, D. Schlitt, C. Pieper, C. Adeniyi-Jones, J. Setoain, L. Ceva, and U. Janssen. 2017. M2DC—Modular Microserver DataCentre with heterogeneous hardware. Microprocess. Microsyst. 52 (2017), 117--130.

Digital Library

[132]

D. A. G. Oliveira, P. Rech, L. L. Pilla, P. O. A. Navaux, and L. Carro. 2014. GPGPUs ECC efficiency and efficacy. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT’14). 209--215.

[133]

M. A. Oxley, E. Jonardi, S. Pasricha, A. A. Maciejewski, H. J. Siegel, P. J. Burns, and G. A. Koenig. 2018. Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. J. Parallel Distrib. Comput. 112 (2018), 126--139. Parallel Optimization using/for Multi and Many-core High Performance Computing.

Digital Library

[134]

K. O’brien, I. Pietri, R. Reddy, A. Lastovetsky, and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50, 3, Article 37 (June 2017), 38 pages.

[135]

S. Park and M. Humphrey. 2011. Predictable high-performance computing using feedback control and admission control. IEEE Trans. Parallel Distrib. Syst. 22, 3 (Mar. 2011), 396--411.

Digital Library

[136]

A. J. Peña, W. Bland, and P. Balaji. 2015. VOCL-FT: Introducing techniques for efficient soft error coprocessor recovery. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 1--12.

[137]

J. D. Pfefferman and B. Cernuschi-Frias. 2002. A nonparametric nonstationary procedure for failure prediction. IEEE Trans. Reliabil. 51, 4 (Dec. 2002), 434--442.

[138]

M. Pizza, L. Strigini, A. Bondavalli, and F. Di Giandomenico. 1998. Optimal discrimination between transient and permanent faults. In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 214--223.

[139]

B. Pourghassemi and A. Chandramowlishwaran. 2017. cudaCR: An in-kernel application-level checkpoint/restart scheme for CUDA-enabled GPUs. In Proceedings of the IEEE Cluster Conference (CLUSTER’17).

[140]

R. Rajachandrasekar, X. Besseron, and D. K. Panda. 2012. Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium. 1136--1143.

[141]

K. K. Rangan, G. Y. Wei, and D. Brooks. 2009. Thread motion: Fine-grained power management for multi-core systems. SIGARCH Comput. Architect. News 37, 3 (June 2009), 302--313.

Digital Library

[142]

Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.

[143]

F. Reghenzani, G. Massari, and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52, 1, Article 18 (Feb. 2019), 36 pages.

[144]

F. Reghenzani, G. Pozzi, G. Massari, S. Libutti, and W. Fornaciari. 2016. The MIG framework: Enabling transparent process migration in open MPI. In Proceedings of the 23rd European MPI Users’ Group Meeting (EuroMPI’16). ACM, New York, NY, 64--73.

[145]

F. Salfner, M. Lenk, and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42, 3, Article 10 (March 2010), 42 pages.

[146]

F. Salfner, M. Schieschke, and M. Malek. 2006. Predicting failures of computer systems: A case study for a telecommunication system. In Proceedings of the 20th IEEE International Parallel Distributed Processing Symposium.

[147]

A. Sari, M. Psarakis, and D. Gizopoulos. 2013. Combining checkpointing and scrubbing in FPGA-based real-time systems. In Proceedings of the IEEE 31st VLSI Test Symposium (VTS’13). 1--6.

[148]

M. Shereshevsky, J. Crowell, B. Cukic, V. Gandikota, and Y. Liu. 2003. Software aging and multifractality of memory resources. In Proceedings of the International Conference on Dependable Systems and Networks.721--730.

[149]

L. Shi, H. Chen, J. Sun, and K. Li. 2012. vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61, 6 (June 2012), 804--816.

Digital Library

[150]

D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems, 3rd ed. A. K. Peters, Ltd.

[151]

R. M. Singer, K. Gross, J. Herzog, R. W. King, and S. Wegerich. 1997. Model-based nuclear power plant monitoring and fault detection: Theoretical foundations. In Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP’97).

[152]

S. Singh and I. Chana. 2016. A survey on resource scheduling in cloud computing: Issues and challenges. J. Grid Comput. 14, 2 (June 2016), 217--264.

Digital Library

[153]

T. J. Slegel, R. M. Averill, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. 1999. IBM’s S/390 G5 microprocessor design. IEEE Micro 19, 2 (Mar. 1999), 12--23.

Digital Library

[154]

P. Smith and N. C. Hutchinson. 1998. Heterogeneous process migration: The Tui system. Softw.-Pract. Exper. 28, 6 (1998), 611--640.

Digital Library

[155]

A. Sridhar, M. M. Sabry, and D. Atienza. 2014. A semi-analytical thermal modeling framework for liquid-cooled ICs. IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst. 33, 8 (Aug. 2014), 1145--1158.

[156]

V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. 2015. Memory errors in modern systems. ACM SIGARCH Comput. Architect. News 43, 1 (2015), 297--310.

Digital Library

[157]

J. H. Stathis. 2018. The physics of NBTI: What do we really know? In Proceedings of the IEEE International Reliability Physics Symposium (IRPS’18). 2A.1--1--2A.1--4.

[158]

G. Stellner. 1996. CoCheck: Checkpointing and process migration for MPI. In Proceedings of International Conference on Parallel Processing. 526--531.

Digital Library

[159]

G. Stellner. 1996. CoCheck: Checkpointing and process migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS’96). IEEE, 526--531.

[160]

J. E. Stone, D. Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3 (May 2010), 66--73.

[161]

O. Subasi, S. Di, L. Bautista-Gomez, P. Balaprakash, O. Unsal, J. Labarta, A. Cristal, and F. Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’16). 413--424.

[162]

O. Subasi, S. Di, L. Bautista-Gomez, P. Balaprakash, O. Unsal, J. Labarta, A. Cristal, S. Krishnamoorthy, and F. Cappello. 2018. Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustain. Comput.: Inform. Syst. 19 (2018), 277--290.

[163]

K. Sutaria, A. Ramkumar, R. Zhu, R. Rajveev, Y. Ma, and Y. Cao. 2014. BTI-induced aging under random stress waveforms: Modeling, simulation and silicon validation. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY, Article 203, 6 pages.

[164]

T. Suzuki, A. Nukada, and S. Matsuoka. 2016. Transparent checkpoint and restart technology for CUDA applications. In Proceedings of the GPU Technology Conference (GTC’16). https://tinyurl.com/ycb7y8xw.

[165]

H. Takizawa, K. Koyama, K. Sato, K. Komatsu, and H. Kobayashi. 2011. CheCL: Transparent checkpointing and process migration of OpenCL applications. In Proceedings of the IEEE International Parallel Distributed Processing Symposium. 864--876.

[166]

H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi. 2009. CheCUDA: A checkpoint/restart tool for CUDA applications. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. 408--413.

[167]

D. Tang and R. K. Iyer. 1993. Dependability measurement and modeling of a multicomputer system. IEEE Trans. Comput. 42, 1 (Jan. 1993), 62--75.

Digital Library

[168]

D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 331--342.

[169]

D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. Debardeleben, P. Navaux, L. Carro, and A. Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 331--342.

[170]

T. Troudet and W. Merrill. 1990. A real time neural net estimator of fatigue life. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’90). 59--64 vol. 2.

[171]

D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California, San Diego, CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.

[172]

R. Vilalta, C. V. Apte, J. L. Hellerstein, S. Ma, and S. M. Weiss. 2002. Predictive algorithms in the management of computer systems. IBM Syst. J. 41, 3 (2002), 461--474.

Digital Library

[173]

R. Vilalta and S. Ma. 2002. Predicting rare events in temporal domains. In Proceedings of the IEEE International Conference on Data Mining. 474--481.

[174]

S. Vinoski. 2007. Reliability with Erlang. IEEE Internet Comput. 11, 6 (2007), 79--81.

Digital Library

[175]

C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. 2008. Proactive process-level live migration in HPC environments. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Press, 43.

[176]

J. P. Wang and S. M. Shatz. 1988. Reliability-oriented task allocation in redundant distributed systems. In Proceedings of the 12th Annual International Computer Software Applications Conference (COMPSAC’88). 276--283.

[177]

A. Ward, P. Glynn, and K. Richardson. 1998. Internet service performance failure detection. SIGMETRICS Perform. Eval. Rev. 26, 3 (Dec. 1998), 38--43.

Digital Library

[178]

G. M. Weiss. 1999. Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, 718--725.

[179]

R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström. 2008. The worst-case execution-time problem—Overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst. 7, 3, Article 36 (May 2008), 53 pages.

Digital Library

[180]

Y. Xiang, T. Chantem, R. P. Dick, X. S. Hu, and L. Shang. 2010. System-level reliability modeling for MPSoCs. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS’10). ACM, New York, NY, 297--306.

[181]

X. Xu, Y. Lin, T. Tang, and Y. Lin. 2010. HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems. In Proceedings of the 5th International Conference on Computer Science Education. 1895--1899.

[182]

J. Yang, X. Zhou, M. Chrobak, Y. Zhang, and L. Jin. 2008. Dynamic thermal management through task scheduling. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’08). 191--201.

[183]

E. Yilmaz and L. Gilly. 2012. Redundancy and Reliability for an HPC Data Centre. Retrieved from http://www.prace-ri.eu/IMG/pdf/HPC-Centre-Redundancy-Reliability-WhitePaper.pdf.

[184]

A. B. Yoo, M. A. Jette, and M. Grondona. 2003. SLURM: Simple Linux utility for resource management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin, 44--60.

[185]

Z. Yuan, Y. Liu, J. Li, J. Hu, C. J. Xue, and H. Yang. 2017. CP-FPGA: Energy-efficient nonvolatile FPGA with offline/online checkpointing optimization. IEEE Trans. Very Large Scale Integr. Syst. 25, 7 (July 2017), 2153--2163.

Cited By

Leon GBadia JBelloch JGarcia-Valderas MLindoso AEntrena L(2024)Analyzing the Influence of Memory and Workload on the Reliability of GPUs Under Neutron RadiationIEEE Transactions on Nuclear Science10.1109/TNS.2024.338749071:8(1487-1495)Online publication date: Aug-2024
https://doi.org/10.1109/TNS.2024.3387490
Bakhshalipour MGibbons P(2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00047
Ahmadilivani MMousavi SRaik JDaneshtalab MJenihhin M(2024)Cost-Effective Fault Tolerance for CNNs Using Parameter Vulnerability Based Hardening and Pruning2024 IEEE 30th International Symposium on On-Line Testing and Robust System Design (IOLTS)10.1109/IOLTS60994.2024.10616072(1-7)Online publication date: 3-Jul-2024
https://doi.org/10.1109/IOLTS60994.2024.10616072
Show More Cited By

Index Terms

Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Grid computing
  2. Dependable and fault-tolerant systems and networks
    1. Reliability
2. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

The Reliability Wall for Exascale Supercomputing

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-...
A Large-Scale Study of Failures in High-Performance Computing Systems

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-...
Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing
ICPP '12: Proceedings of the 2012 41st International Conference on Parallel Processing

Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 53, Issue 5

September 2021

782 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3426973

Editor:
Albert Zomaya
University of Sydney, Austraila

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2020

Accepted: 01 May 2020

Revised: 01 April 2020

Received: 01 October 2019

Published in CSUR Volume 53, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

Ramon y Cajal Postdoctoral Fellowship
Generalitat de Catalunya
European Union's Horizon 2020 (H2020) research and innovation program un- der the FET-HPC
Ministry of Economy and Competitiveness of Spain
HiPEAC Network of Excellence

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
674
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leon GBadia JBelloch JGarcia-Valderas MLindoso AEntrena L(2024)Analyzing the Influence of Memory and Workload on the Reliability of GPUs Under Neutron RadiationIEEE Transactions on Nuclear Science10.1109/TNS.2024.338749071:8(1487-1495)Online publication date: Aug-2024
https://doi.org/10.1109/TNS.2024.3387490
Bakhshalipour MGibbons P(2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00047
Ahmadilivani MMousavi SRaik JDaneshtalab MJenihhin M(2024)Cost-Effective Fault Tolerance for CNNs Using Parameter Vulnerability Based Hardening and Pruning2024 IEEE 30th International Symposium on On-Line Testing and Robust System Design (IOLTS)10.1109/IOLTS60994.2024.10616072(1-7)Online publication date: 3-Jul-2024
https://doi.org/10.1109/IOLTS60994.2024.10616072
Canino NDi Matteo SRossi DSaponara S(2024)HW-SW Interface Design and Implementation for Error Logging and Reporting for RAS ImprovementIEEE Access10.1109/ACCESS.2024.339384412(60081-60094)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3393844
Isukapalli SSrirama S(2024)A systematic survey on fault-tolerant solutions for distributed data analyticsComputer Science Review10.1016/j.cosrev.2024.10066053:COnline publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1016/j.cosrev.2024.100660
Leon GBadia JBelloch JLindoso AEntrena L(2024)Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUsThe Journal of Supercomputing10.1007/s11227-024-05925-080:9(12844-12862)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s11227-024-05925-0
Bulut SWright S(2023)Optimizing Write Performance for Checkpointing to Parallel File Systems Using LSM-TreesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624118(492-501)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624118
Agiakatsikas DPapadimitriou GKarakostas VGizopoulos DPsarakis MBelanger-Champagne CBlackmore E(2023)Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614304(957-971)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614304
Reghenzani FGuo ZFornaciari W(2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
https://dl.acm.org/doi/10.1145/3589950
Rossi DCanino NDi Matteo SSaponara STenentes V(2023)Design and Evaluation of a Peripheral for Integrity Checking to Improve RAS in RISC-V Architectures2023 8th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM)10.1109/SEEDA-CECNSM61561.2023.10470707(1-6)Online publication date: 10-Nov-2023
https://doi.org/10.1109/SEEDA-CECNSM61561.2023.10470707
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents