Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3392717.3392774acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Characterization and identification of HPC applications at leadership computing facility

Published: 29 June 2020 Publication History

Abstract

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation.
In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O.
Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.

References

[1]
William Allcock et al. 2017. Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--24.
[2]
Gonzalo Pedro Rodrigo Alvarez et al. 2016. Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. In 16th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Comp. (CCGrid). IEEE, 521--526.
[3]
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al. 2019. Parsl: Pervasive parallel programming in Python. In 28th International Symposium on High-Performance Parallel and Distributed Computing. 25--36.
[4]
K. Bergman. 2018. Empowering Flexible and Scalable High Performance Architectures with Embedded Photonics. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 378--378.
[5]
Philip Carns. 2014. Darshan. In High Performance Parallel I/O. Chapman and Hall/CRC, 351--358.
[6]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. arXiv preprint arXiv.1603.02754 (2016).
[7]
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 30, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291696
[8]
Anwesha Das, Frank Mueller, Charles Siegel, and Abhinav Vishnu. 2018. Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). Association for Computing Machinery, New York, NY, USA, 40--51.
[9]
Jack J Dongarra et al. 1992. Performance of various computers using standard linear equations software. ACM SIGARCH Computer Architecture News 20, 3 (1992), 22--44.
[10]
Argonne Leadership Computing Facility. [n.d.]. Job Scheduling Policy for Mira/Cetus/Vesta. https://www.alcf.anl.gov/support-center/miracetusvesta/job-scheduling-policy-miracetusvesta.
[11]
Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Supercomputing, Vol. 99. 5--33.
[12]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[13]
A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert, and M. Snir. 2015. Scheduling the I/O of HPC Applications Under Congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1013--1022.
[14]
Salman Habib, Vitali Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joe Insley, Venkat Vishwanath, Zarija Lukic, David Daniel, Patricia Fasel, and Nicholas Frontiere. 2013. Blasting Through the 10 Petaflops Barrier: HACC on the BG/Q. https://press3.mcs.anl.gov//salman-habib/files/2013/05/hacc_pflops.pdf.
[15]
Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, Venkatram Vishwanath, Zarija Lukić, Saba Sehrish, and Wei-keng Liao. 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.
[16]
J. J. Hack and M. E. Papka. 2014. New Frontiers in Leadership Computing. Computing in Science & Engineering 16, 6 (Nov 2014), 10--12.
[17]
J. J. Hack and M. E. Papka. 2015. Big Data: Next-Generation Machines for Big Science. Computing in Science & Engineering 17, 4 (July 2015), 63--65.
[18]
Jim Collins Jared Sagoff. 2019 (accessed Dec 3, 2019). A game changer for computational materials science. https://www.alcf.anl.gov/news/argonne-s-mira-supercomputer-set-retire-after-years-enabling-groundbreaking-science.
[19]
Jianwei Li, Wei-keng Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale. 2003. Parallel netCDF: A High-Performance Scientific I/O Interface. In ACM/IEEE Conference on Supercomputing. 39--39.
[20]
Wayne Joubert et al. 2012. An Analysis of Computational Workloads for the ORNL Jaguar System. In 26th ACM Int. Conf. on SuperComp. ACM, 247--256.
[21]
Qiao Kang, Ankit Agrawal, Alok N. Choudhary, Alex Sim, Kesheng Wu, Rajkumar Kettimuthu, Peter H. Beckman, Zhengchun Liu, and Wei-Keng Liao. 2019. Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems. In Big Data Predictive Maintenance using Artificial Intelligence workshop.
[22]
Rajkumar Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, and Franck Cappello. 2018. Transferring a petabyte in a day. Future Generation Computer Systems 88(2018), 191--198.
[23]
Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, et al. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (2018), 195901.
[24]
Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.
[25]
Gary Lakner, Brant Knudson, et al. 2013. IBM system Blue Gene solution: Blue Gene/Q system administration. IBM Redbooks.
[26]
Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific User Behavior and Data-sharing Trends in a Petascale File System. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 46, 12 pages.
[27]
Yuanlai Liu, Zhengchun Liu, Rajkumar Kettimuthu, Nageswara Rao, Zizhong Chen, and Ian Foster. 2019. Data Transfer between Scientific Facilities - Bottleneck Analysis, Insights and Optimizations. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 122--131.
[28]
Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing (Washington, DC, USA). ACM, New York, NY, USA, 167--178.
[29]
Zhengchun Liu, Rajkumar Kettimuthu, Prasanna Balaprakash, and Ian Foster. 2018. Building a Wide-Area Data Transfer Performance Predictor: An Empirical Study. In 1st International Conference on Machine Learning for Networking (Paris, France). Springer, 20.
[30]
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2018. Toward a smart data transfer node. Future Generation Computer Systems 89 (2018), 10--18.
[31]
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Yuanlai Liu. 2018. A comprehensive study of wide area data movement at a scientific computing facility. In 38th IEEE International Conference on Distributed Computing Systems (Vienna, Austria). IEEE, 8.
[32]
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S. V. Rao. 2018. Cross-geography Scientific Data Transferring Trends and Behavior. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). ACM, New York, NY, USA, 267--278.
[33]
Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J. Wright. 2018. A Year in the Life of a Parallel File System. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 74, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291755
[34]
Lustre. 2019 (accessed Dec 3, 2019). Data on MDT Solution Architecture. http://wiki.lustre.org/Data_on_MDT_Solution_Architecture.
[35]
Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A Multiplatform Study of I/O Behavior on Petascale Supercomputers. In 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA). ACM, New York, NY, USA, 33--44.
[36]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.
[37]
Robert Mawhinney. 2013. Lattice QCD from Mira or Probing Quarks at a Sustained Petaflops. https://www.alcf.anl.gov/files/Mawhinney_ESP_May_2013_0.pdf.
[38]
Sally A McKee, Steven A Moyer, Wm A Wulf, and Charles Hitchcock. 1994. Increasing memory bandwidth for vector computations. In Programming Languages and System Architectures. Springer, 87--104.
[39]
Wes McKinney et al. 2010. Data structures for statistical computing in Python. In 9th Python in Science Conference, Vol. 445. SciPy Austin, TX, 51--56.
[40]
George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, and et al. 2019. Bandwidth Steering in HPC Using Silicon Nanophotonics. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). Association for Computing Machinery, New York, NY, USA, Article Article 41, 25 pages.
[41]
National Academies of Sciences, Engineering, and Medicine. 2016. Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020. The National Academies Press, Washington, DC.
[42]
Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O Behavior in Large-scale Storage Systems: The Expected and the Unexpected. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 65, 13 pages.
[43]
Stephan Schlagkamp et al. 2016. Analyzing Users in Parallel Compiting: A User-Oriented Study. In 2016 Int. Conf. on High Performance Comp. & Simulation. IEEE, 395--402.
[44]
Stephan Schlagkamp et al. 2016. Consecutive Job Submission Behavior at Mira Supercomputer. In 25th ACM Int. Symposium on High-Performance Parallel and Dist. Comp. ACM, 93--96.
[45]
S. Snyder, P. Carns, K. Harms, R. Ross, G. K. Lockwood, and N. J. Wright. 2016. Modular HPC I/O Characterization with Darshan. In 5th Workshop on Extreme-Scale Programming Tools (ESPT). 9--17.
[46]
TOP500. 2019 (accessed May 3, 2019). TOP500 Supercomputer. https://www.top500.org.
[47]
Bob Walkup. 2019 (accessed Dec 3, 2019). Application Performance Characterization and Analysis on Blue Gene/Q. https://www.alcf.anl.gov/files/miracon_AppPerform_BobWalkup_1.pdf.
[48]
Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swift: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011), 633--652.
[49]
Justin M Wozniak, Matthieu Dorier, Robert Ross, Tong Shu, Tahsin Kurc, Li Tang, Norbert Podhorszki, and Matthew Wolf. 2019. MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows. Future Generation Computer Systems 101 (2019), 576--589.
[50]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.

Cited By

View all
  • (2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
  • (2023)FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00036(299-310)Online publication date: May-2023
  • (2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
June 2020
499 pages
ISBN:9781450379830
DOI:10.1145/3392717
  • General Chairs:
  • Eduard Ayguadé,
  • Wen-mei Hwu,
  • Program Chairs:
  • Rosa M. Badia,
  • H. Peter Hofstee
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. application identification
  2. characterization
  3. high performance computing
  4. logs data mining

Qualifiers

  • Research-article

Conference

ICS '20
Sponsor:
ICS '20: 2020 International Conference on Supercomputing
June 29 - July 2, 2020
Spain, Barcelona

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)17
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
  • (2023)FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00036(299-310)Online publication date: May-2023
  • (2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023
  • (2023)I/O-signature-based feature analysis and classification of high-performance computing applicationsCluster Computing10.1007/s10586-023-04139-y27:3(3219-3231)Online publication date: 24-Sep-2023
  • (2022)Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage SystemsProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531457(199-212)Online publication date: 27-Jun-2022
  • (2022)Design and Performance Characterization of RADICAL-Pilot on Leadership-Class PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310599433:4(818-829)Online publication date: 1-Apr-2022
  • (2022)LuxIO: Intelligent Resource Provisioning and Auto-Configuration for Storage Services2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00041(246-255)Online publication date: Dec-2022
  • (2022)I/O separation scheme on Lustre metadata server based on multi-stream SSDCluster Computing10.1007/s10586-022-03801-126:5(2883-2896)Online publication date: 17-Nov-2022
  • (2022)Unveiling User Behavior on Summit Login Nodes as a UserComputational Science – ICCS 202210.1007/978-3-031-08751-6_37(516-529)Online publication date: 15-Jun-2022
  • (2021)Design and Evaluation of a Simple Data Interface for Efficient Data Transfer across Diverse StorageACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34520076:1(1-25)Online publication date: 29-May-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media