Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2949550.2949643acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article
Public Access

Practical Monitoring of Resource Utilization for HPC Applications

Published: 17 July 2016 Publication History

Abstract

HPC centers run a diverse set of applications from a variety of scientific domains. Every application has different resource requirements, but it is difficult for domain experts to find out what these requirements are and how they impact performance. In particular, the utilization of shared resources such as parallel file systems may influence application performance in significant ways that are not always obvious to the user. We present a tool designed to provide the information that is most critical for running an application efficiently on HPC systems. The information provided forms a complete view of the application's interaction with the system resources, which is typically missing from other profiling and analysis tools. The tool is designed to be scalable and have minimal impact on application performance, and includes support for different accelerators.

References

[1]
C. Rosales, A. Gómez-Iglesias, and A. Predoehl, "REMORA: A resource monitoring tool for everyone," in Proceedings of the Second International Workshop on HPC User Support Tools, ser. HUST '15. New York, NY, USA: ACM, 2015, pp. 3:1--3:8. {Online}. Available: http://doi.acm.org/10.1145/2834996.2834999
[2]
"REMORA repository," https://github.com/TACC/remora, accessed: 2016-04-25.
[3]
"MP-LABS," https://github.com/carlosrosales/mplabs, accessed: 2016-04-25.
[4]
"peak_memusage repository," https://github.com/davidedelvento/peak_memusage, accessed: 2016-04-25.
[5]
"qmem repository," https://github.com/AnthonyDiGirolamo/qmem, accessed: 2016-04-25.
[6]
C. Moore, P. Khalsa, T. Yilk, and M. Mason, "Monitoring high performance computing systems for the end user," in Cluster Computing (CLUSTER), 2015 IEEE International Conference on, Sept 2015, pp. 714--716.
[7]
A. Agelastos, B. A. Allan, J. M. Brandt, P. Cassella, J. Enos, J. Fullop, A. C. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. T. Showerman, J. Stevenson, N. Taerat, and T. Tucker, "The lightweight distributed metric service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications," in International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16-21, 2014, T. Damkroger and J. Dongarra, Eds. IEEE, 2014, pp. 154--165. {Online}. Available: http://dx.doi.org/10.1109/SC.2014.18
[8]
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson, "Toward rapid understanding of production hpc applications and systems," in Cluster Computing (CLUSTER), 2015 IEEE International Conference on, Sept 2015, pp. 464--473.
[9]
M. Massie, B. Li, B. Nicholes, V. Vuksan, R. Alexander, J. Buchbinder, F. Costa, A. Dean, D. Josephsen, P. Phaal, and D. Pocock, Monitoring with Ganglia, 1st ed. O'Reilly Media, Inc., 2012.
[10]
D. Josephsen, Building a monitoring infrastructure with Nagios. Prentice Hall PTR, 2007.
[11]
E. Birngruber, P. Forai, and A. Zauner, "Total Recall: Holistic Metrics for Broad Systems Performance and User Experience Visibility in a Data-intensive Computing Environment," in Proceedings of the Second International Workshop on HPC User Support Tools, ser. HUST '15. New York, NY, USA: ACM, 2015, pp. 5:1--5:12. {Online}. Available: http://doi.acm.org/10.1145/2834996.2835001
[12]
J. M. Brandt, B. J. Debusschere, A. C. Gentile, J. R. Mayo, P. P. Pébay, D. Thompson, and M. H. Wong, "OVIS-2: A robust distributed architecture for scalable RAS," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008, pp. 1--8.
[13]
Online, "Performance Co-Pilot," https://http://www.pcp.io/, accessed: 2016-04-25.
[14]
"HOPSA-Holistic Performance System Analysis," http://www.vi-hps.org/projects/hopsa/overview, accessed: 2016-04-25.

Cited By

View all
  • (2024)Advancements in High Performance Computing Cluster Resource Utilization through a Comprehensive Monitoring Dashboard2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498826(158-165)Online publication date: 28-Feb-2024
  • (2023)REMORA Resource Monitor: Usability, Performance and User Interface ImprovementsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624141(663-672)Online publication date: 12-Nov-2023
  • (2018)High Performance Cluster Monitoring System2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.23919/APSIPA.2018.8659536(1188-1193)Online publication date: Nov-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale
July 2016
405 pages
ISBN:9781450347556
DOI:10.1145/2949550
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. monitoring
  3. resources

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

XSEDE16

Acceptance Rates

Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)163
  • Downloads (Last 6 weeks)23
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Advancements in High Performance Computing Cluster Resource Utilization through a Comprehensive Monitoring Dashboard2024 11th International Conference on Computing for Sustainable Global Development (INDIACom)10.23919/INDIACom61295.2024.10498826(158-165)Online publication date: 28-Feb-2024
  • (2023)REMORA Resource Monitor: Usability, Performance and User Interface ImprovementsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624141(663-672)Online publication date: 12-Nov-2023
  • (2018)High Performance Cluster Monitoring System2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.23919/APSIPA.2018.8659536(1188-1193)Online publication date: Nov-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media