Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3502181.3531465acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Understanding Memory Failures on a Petascale Arm System

Published: 27 June 2022 Publication History

Abstract

New and novel HPC platforms provide interesting challenges and opportunities. Analysis of these systems can provide a better understanding of both the specific platform being studied as well as large-scale systems in general. Arm is one such architecture that has been explored in HPC for several years, however little is still known about its viability for supporting large-scale production workloads in terms of system reliability. The Astra system at Sandia National Laboratories was the first public peta-FLOPS Arm-based system on the Top500 and has been successfully running production HPC applications for a couple of years. In this paper, we analyze memory failure data collected from Astra while the system was in production running unclassified applications. This analysis revealed several interesting contributions related to both the Arm platform and to HPC systems in general. First, we outline the number of components replaced due to reliability issues in standing-up this first-of-its-kind, large-scale HPC system. We show the distribution differences between correctable DRAM faults and errors on Astra, showing that, not properly accounting for faults can lead to erroneous conclusions. Additionally, we characterize DRAM faults on the system and show contrary to existing work that memory faults are uniformly distributed across CPU socket, DRAM column, bank and rack region, but are not uniform across node, DIMM rank, DIMM slot on the motherboard, and system rack: some racks, ranks and DIMM slots experience more faults than others. Similarly, we show the impact of temperature and power on DRAM correctable errors. Finally, we make a detailed comparison of results presented here with the positional affects found in several previous large-scale reliability studies. The results of this analysis provide valuable guidance to organizations standing-up first-in- class platforms in HPC, organizations using Arm in HPC, and the entire large-scale HPC community in general.

References

[1]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on, 1(1):11--33, 2004.
[2]
L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '16, pages 55:1--55:11, Piscataway, NJ, USA, 2016. IEEE Press.
[3]
A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Rev., 51(4):661--703, Nov. 2009.
[4]
T. J. Dell. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division, Nov. 1997.
[5]
N. El-Sayed and B. Schroeder. Reading between the lines of failure logs: Understanding how HPC systems fail. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1--12, June 2013.
[6]
N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, SIGMETRICS '12, pages 163--174, New York, NY, USA, 2012. ACM.
[7]
K. B. Ferreira, S. Levy, J. Hemmert, and K. Pedretti. Astra memory error and system monitoring data sets. https://doi.org/10.5281/zenodo.6515019, May 2022.
[8]
A. Gainaru, F. Cappello, and W. Kramer. Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 1168--1179, May 2012.
[9]
S. Gupta, T. Patel, C. Engelmann, and D. Tiwari. Failures in large scale systems: Long-term measurement, analysis, and implications. In Proceedings of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis, SC '17, pages 44:1--44:12, New York, NY, USA, 2017. ACM.
[10]
S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell. Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN '15, pages 37--44, Washington, DC, USA, 2015. IEEE Computer Society.
[11]
C.-H. Hsu, W.-C. Feng, and J. S. Archuleta. Towards efficient supercomputing: A quest for the right metric. In 19th IEEE International Parallel and Distributed Processing Symposium, pages 8--pp. IEEE, 2005.
[12]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In Proceedings of the 17th international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 111--122, New York, NY, USA, 2012. ACM.
[13]
S. Levy, K. B. Ferreira, N. DeBardeleben, T. Siddiqua, V. Sridharan, and E. Baseman. Lessons learned from memory errors observed over the lifetime of cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '18. IEEE Press, 2018.
[14]
X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC'10, pages 6--20, Berkeley, Calif., USA, 2010. USENIX Association.
[15]
X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ATC'07, pages 21:1--21:6, Berkeley, Calif., USA, 2007. USENIX Association.
[16]
Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo. Bluegene/l failure analysis and prediction models. In International Conference on Dependable Systems and Networks (DSN'06), pages 425--434, June 2006.
[17]
Y. Liang, Y. Zhang, A. Sivasubramaniam, R. K. Sahoo, J. Moreira, and M. Gupta. Filtering failure logs for a BlueGene/L prototype. In 2005 International Conference on Dependable Systems and Networks (DSN'05), pages 476--485, June 2005.
[18]
K. Macarenco, K. Frye, B. Hamlin, and K. L. Karavanic. The effects of system management interrupts on multithreaded, hyper-threaded, and MPI applications. In 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pages 338--345, Aug 2016.
[19]
S. McIntosh-Smith, J. Price, T. Deakin, and A. Poenaru. A performance analysis of the first generation of hpc-optimized arm processors. Concurrency and Computation: Practice and Experience, 31(16):e5110, 2019. e5110 cpe.5110.
[20]
J. Meza, Q. Wu, S. Kumar, and O. Mutlu. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 415--426, June 2015.
[21]
B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers. A large-scale study of soft-errors on GPUs in the field. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 519--530, March 2016.
[22]
G. Ostrouchov, D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, and J. H. Rogers. Gpu lifetimes on titan supercomputer: Survival analysis and reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press, 2020.
[23]
A. Patwari, I. Laguna, M. Schulz, and S. Bagchi. Understanding the spatial characteristics of DRAM errors in HPC clusters. In Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS '17, pages 17--22, New York, NY, USA, 2017. ACM.
[24]
K. Pedretti and et at. Chronicles of Astra: Challenges and lessons from the first petascale arm supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press, 2020.
[25]
O. Sarood, E. Meneses, and L. V. Kale. A 'cool' way of improving the reliability of HPC machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1--12, 2013.
[26]
M. Sato and et at. Co-design for a64fx manycore processor and "fugaku". In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press, 2020.
[27]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Dependable Systems and Networks (DSN 2006), Philadelphia, PA, June 2006.
[28]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. In Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, SIGMETRICS '09, pages 193--204, New York, NY, USA, 2009. ACM.
[29]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, Feb. 2011.
[30]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Communications of the ACM, 54:100--107, February 2011.
[31]
T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In Silicon Errors in Logic - System Effects (SELSE), 2013 IEEE Workshop on, 2013.
[32]
T. Siddiqua, V. Sridharan, S. E. Raasch, N. DeBardeleben, K. B. Ferreira, S. Levy, E. Baseman, and Q. Guan. Lifetime memory reliability data from the field. In 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pages 1--6, Oct 2017.
[33]
V. Sridharan and D. Liberty. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 76:1--76:11, Los Alamitos, Calif., USA, 2012. IEEE Computer Society Press.
[34]
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13, pages 22:1--22:11, New York, NY, USA, 2013. ACM.
[35]
R. Stevens, J. Ramprakash, P. Messina, M. Papka, and K. Riley. Aurora: Argonne's next-generation exascale supercomputer. 3 2019.
[36]
D. Tang, P. Carruthers, Z. Totari, and M. W. Shapiro. Assessment of the effect of memory page retirement on system ras against hardware faults. In International Conference on Dependable Systems and Networks (DSN'06), pages 365--370, June 2006.
[37]
D. Tiwari and et al. Understanding gpu errors on large-scale hpc systems and the implications for system design and operation. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pages 331--342, Feb 2015.
[38]
D. Tiwari, S. Gupta, G. Gallarno, J. Rogers, and D. Maxwell. Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leader- ship computing facility. In SC15: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--12, Nov 2015.

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00018(116-126)Online publication date: 13-Nov-2024
  • (2023)Using Benford's Law to Identify Unusual Failure RegionsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624121(516-519)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '22: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
June 2022
314 pages
ISBN:9781450391993
DOI:10.1145/3502181
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. arm
  2. dram reliability
  3. hardware infant mortality
  4. memory failures
  5. temperature correlation

Qualifiers

  • Research-article

Conference

HPDC '22

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)66
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00018(116-126)Online publication date: 13-Nov-2024
  • (2023)Using Benford's Law to Identify Unusual Failure RegionsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624121(516-519)Online publication date: 12-Nov-2023
  • (2023)A Systematic Study of DDR4 DRAM Faults in the Field2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071066(991-1002)Online publication date: Feb-2023
  • (2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media