Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063348.2063355acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

System-level monitoring of floating-point performance to improve effective system utilization

Published: 12 November 2011 Publication History

Abstract

NCAR's Bluefire supercomputer is instrumented with a set of low-overhead processes that continually monitor the floating-point counters of its 3,840 batch-compute cores. We extract performance numbers for each batch job by correlating the data from corresponding nodes. From experience and heuristics for good performance, we use this data, in part, to identify poorly performing jobs and then work with the users to improve their job's efficiency. Often, the solution involves simple steps such as spawning an adequate number of processes or threads, binding the processes or threads to cores, using large memory pages, or using adequate compiler optimization. These efforts typically result in performance improvements and a wall-clock runtime reduction of 10% to 20%. With more involved changes to codes and scripts, some users have obtained performance improvements of 40% to 90%. We discuss our instrumentation, some successful cases, and its general applicability to other systems.

References

[1]
CESM Home page. http://www.cesm.ucar.edu/.
[2]
Ganglia Home page. http://ganglia.sourceforge.net/.
[3]
IPM Home page. http://ipm-hpc.sourceforge.net/.
[4]
The NCAR Mission Statement. http://www.ncar.ucar.edu/ncar/mission.html.
[5]
POP Home page. http://climate.lanl.gov/Models/POP/.
[6]
WRF Home page. http://www.wrf-model.org/index.php.
[7]
Gary L. Mullen-Schultz et. al. Blue Gene/L: Performance Analysis Tools. IBM publications, http://www.redbooks.ibm.com/redbooks/pdfs/sg247278.pdf, July 2006.
[8]
Intel Performance Counter Monitor - A better way to measure CPU utilization. http://software.intel.com/en-us/articles/intel-performance-counter-monitor/.

Cited By

View all
  • (2018)A Job Sizing Strategy for High-Throughput Scientific WorkflowsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.276231029:2(240-253)Online publication date: 1-Feb-2018
  • (2016)Performance Analysis Tool for HPC and Big Data Applications on Scientific ClustersConquering Big Data with High Performance Computing10.1007/978-3-319-33742-5_7(139-161)Online publication date: 17-Sep-2016
  • (2015)Practical Resource Monitoring for Robust High Throughput ComputingProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.115(650-657)Online publication date: 8-Sep-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: State of the Practice Reports
November 2011
242 pages
ISBN:9781450311397
DOI:10.1145/2063348
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. operational or end-user support
  2. performance

Qualifiers

  • Research-article

Conference

SC '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)A Job Sizing Strategy for High-Throughput Scientific WorkflowsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.276231029:2(240-253)Online publication date: 1-Feb-2018
  • (2016)Performance Analysis Tool for HPC and Big Data Applications on Scientific ClustersConquering Big Data with High Performance Computing10.1007/978-3-319-33742-5_7(139-161)Online publication date: 17-Sep-2016
  • (2015)Practical Resource Monitoring for Robust High Throughput ComputingProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.115(650-657)Online publication date: 8-Sep-2015
  • (2014)Comprehensive resource use monitoring for HPC systems with TACC statsProceedings of the First International Workshop on HPC User Support Tools10.1109/HUST.2014.7(13-21)Online publication date: 16-Nov-2014
  • (2014)Comprehensive, open‐source resource usage measurement and analysis for HPC systemsConcurrency and Computation: Practice and Experience10.1002/cpe.324526:13(2191-2209)Online publication date: 6-Mar-2014
  • (2013)Enabling comprehensive data-driven system management for large computational facilitiesProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503230(1-11)Online publication date: 17-Nov-2013
  • (2013)Comprehensive job level resource usage measurement and analysis for XSEDE HPC systemsProceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery10.1145/2484762.2484781(1-8)Online publication date: 22-Jul-2013
  • (2012)Performance optimization on a supercomputer with cTuning and the PGI compilerProceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era10.1145/2185475.2185477(12-20)Online publication date: 3-Mar-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media