Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2484762.2484763acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article
Open access

Using XDMoD to facilitate XSEDE operations, planning and analysis

Published: 22 July 2013 Publication History

Abstract

The XDMoD auditing tool provides, for the first time, a comprehensive tool to measure both utilization and performance of high-end cyberinfrastructure (CI), with initial focus on XSEDE. Here, we demonstrate, through several case studies, its utility for providing important metrics regarding resource utilization and performance of TeraGrid/XSEDE that can be used for detailed analysis and planning as well as improving operational efficiency and performance.
Measuring the utilization of high-end cyberinfrastructure such as XSEDE helps provide a detailed understanding of how a given CI resource is being utilized and can lead to improved performance of the resource in terms of job throughput or any number of desired job characteristics. In the case studies considered here, a detailed historical analysis of XSEDE usage data using XDMoD clearly demonstrates the tremendous growth in the number of users, overall usage, and scale of the simulations routinely carried out. Not surprisingly, physics, chemistry, and the engineering disciplines are shown to be heavy users of the resources. However, as the data clearly show, molecular biosciences are now a significant and growing user of XSEDE resources, accounting for more than 20 percent of all SUs consumed in 2012. XDMoD shows that the resources required by the various scientific disciplines are very different. Physics, Astronomical sciences, and Atmospheric sciences tend to solve large problems requiring many cores. Molecular biosciences applications on the other hand, require many cycles but do not employ core counts that are as large. Such distinctions are important in guiding future cyberinfrastructure design decisions.
XDMoD's implementation of a novel application kernel-based auditing system to measure overall CI system performance and quality of service is shown, through several examples, to provide a useful means to automatically detect under performing hardware and software. This capability is especially critical given the complex composition of today's advanced CI. Examples include an application kernel based on a widely used quantum chemistry program that uncovered a software bug in the I/O stack of a commercial parallel file system, which was subsequently fixed by the vendor in the form of a software patch that is now part of their standard release. This error, which resulted in dramatically increased execution times as well as outright job failure, would likely have gone unnoticed for sometime and was only uncovered as a result of implementation of XDMoD's suite of application kernels.

References

[1]
Nagios: The Industry Standard. IT Infrastructure Monitoring: (Available from: http://www.nagios.org/{August 19, 2011}).
[2]
Matthew, L., Massie, B., Chun, N., Culler, D. E., The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 2004; 30(8): 817--840.
[3]
Cacti: The Complete RRDTool-based Graphing Solution. (Available from: http://www.cacti.net/ {August 19, 2011}).
[4]
Smallen, S., Olschanowsky, C., Ericson, K., Beckman, P., Schopf, J., The Inca test harness and reporting framework. Proceedings of Supercomputing, Pittsburg PA, 2004; 55--64. See also: Inca. http://inca.sdsc.edu {September 1, 2011}.
[5]
Hawkeye: A Monitoring and Management Tool for Distributed Systems. (Available from: http://www.cs.wisc.edu/condor/hawkeye/ {August 19, 2011}).
[6]
von Laszewski, G., J. DiCarlo, and B. Allcock, "A Portal for Visualizing Grid Usage," Concurrency and Computation: Practice and Experience, vol. 19, iss. 12, pp. 1683--1692, {2007}.
[7]
Martin, S., Lane, P., Foster, I., Christie, M. TeraGrid's GRAM Auditing & Accounting, & Its Integration with the LEAD Science Gateway, TeraGrid Workshop, 2007. (Available from: http://www.globus.org/alliance/publications/papers/TG_GRAM_auditing_and_LEAD_Gateway_final_2.pdf{August 19, 2011}).
[8]
Canal, P., Green, C., GRATIA, a resource accounting system for OSG. CHEP, Victoria, B.C., 2007.
[9]
DOD HPC modernization program metrics. (Available from: http://www.hpcmo.hpc.mil/Htdocs/HPCMETRIC/index.html{Dec16, 2011}).
[10]
Bennett, P. M., Sustained systems performance monitoring at the U. S. Department of Defense high performance computing modernization program. In State of the Practice Reports (SC '11), Article 3. ACM: New York, NY, USA, 2011; 11 pages. http://doi.acm.org/10.1145/2063348.2063352.
[11]
NERSC performance monitoring tools. (Available from: https://www.nersc.gov/research-and-development/performance-and-monitoring-tools/ {December 16, 2011}).
[12]
DOE "operational assessment" metrics for various HPC sites, for example ORNL. http://info.ornl.gov/sites/publications/files/Pub32006.pdf{December 16, 2011}).
[13]
University at Buffalo Metrics on Demand (UBMoD): Open source web portal for mining data from resource managers in HPC environments. Developed at the Center for Computational Research at the University at Buffalo, SUNY. Freely available at SourceForge at http://ubmod.sourceforge.net/{May 1, 2012}.
[14]
Furlani, T. R., Jones, M. D., Gallo, S. M., Bruno, A. E., Lu, C.-D., Ghadersohi, A., Gentner, R. J., Patra, A., DeLeon, R. L., von Laszewski, G., Wang, F., and Zimmerman, A., "Performance metrics and auditing framework using application kernels for high performance computer systems," Concurrency and Computation: Practice and Experience, vol. 25, pp.918--931, 2013. {Online}. Available: http://dx.doi.org/10.1002/cpe.2871
[15]
University at Buffalo, "XDMoD portal." {Online}. Available: https://xdmod.ccr.buffalo.edu
[16]
Katz, D. S., Hart, D., Jordan, C., Majumdar, A., Navarro, J. P., Smith, W., Towns, J., Welch, V., and Wilkins-Diehr, N., "Cyberinfrastructure usage modalities on the TeraGrid," in Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, ser. IPDPSW '11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 932--939. {Online}. Available: http://dx.doi.org/10.1109/IPDPS.2011.239
[17]
"XSEDE Overview." {Online}. Available: https://www.xsede.org/overview
[18]
Hart, D. L., "Measuring TeraGrid: workload characterization for a high-performance computing federation," International Journal of High Performance Computing Applications, vol. 25, no. 4, pp. 451--465, 2011. {Online}. Available: http://hpc.sagepub.com/content/25/4/451.abstract
[19]
Hart, D., "Deep and wide metrics for hpc resource capability and project usage," in State of the Practice Reports, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 1:1--1:7. {Online}. Available: http://doi.acm.org/10.1145/2063348.2063350
[20]
Bennett, P. M., "Sustained systems performance monitoring at the U.S. Department of Defense High Performance Computing Modernization Program," in State of the Practice Reports, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 3:1--3:11. {Online}. Available: http://doi.acm.org/10.1145/2063348.2063352
[21]
Valiev, M., Bylaska, E., Govind, N., Kowalski, K., Straatsma, T., Dam, H. V., Wang, D., Nieplocha, J., Apra, E., Windus, T., and de Jong, W., "NWchem: A comprehensive and scalable open-source solution for large scale molecular simulations," Computer Physics Communications, vol. 181, no. 9, pp. 1477--1489, 2010. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S0010465510001438
[22]
Hammond, J., "TACC Stats I/O performance monitoring for the intransigent," 2011, in 2011 Workshop for Interfaces and Architectures for Scientific Data Storage, IASDS 2011. Available: http://www.mcs.anl.gov/events/workshops/iasds11/presentations/jhammond-iasds.pdf
[23]
Hadri, B., You, H., Moore, S., "Achieve better performance with PEAK on XSEDE resources", XSEDE '12 Proceedings of the 1st Conference of the XSEDE: Bridging from the eXtreme to the campus and beyond, Article 10 (2012): DOI = 10.1145/2335755.2335801
[24]
von Laszewski, G., Lee, H., Diaz, J., Wang, F., Tanaka, K., Karavinkoppa, S., Fox, G. C., and Furlani, T., "Design of an Accounting and Metric-based Cloud-shifting and Cloud-seeding framework for Federated Clouds and Bare-metal Environments,", San Jose, CA., {September, 2012}.

Cited By

View all
  • (2024)First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific WorkloadsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637097(36-44)Online publication date: 11-Jan-2024
  • (2024)Evaluating Return on Investment for Cyberinfrastructure Using the International Integrated Reporting FrameworkSN Computer Science10.1007/s42979-024-02889-z5:5Online publication date: 17-May-2024
  • (2023)Are we ready for broader adoption of ARM in the HPC community: Performance and Energy Efficiency Analysis of Benchmarks and Applications Executed on High-End ARM SystemsProceedings of the HPC Asia 2023 Workshops10.1145/3581576.3581618(78-86)Online publication date: 27-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
July 2013
433 pages
ISBN:9781450321709
DOI:10.1145/2484762
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CI performance metrics
  2. HPC metrics
  3. XDMoD
  4. XSEDE
  5. application kernels
  6. technology audit service

Qualifiers

  • Research-article

Funding Sources

Conference

XSEDE '13

Acceptance Rates

Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)28
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific WorkloadsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637097(36-44)Online publication date: 11-Jan-2024
  • (2024)Evaluating Return on Investment for Cyberinfrastructure Using the International Integrated Reporting FrameworkSN Computer Science10.1007/s42979-024-02889-z5:5Online publication date: 17-May-2024
  • (2023)Are we ready for broader adoption of ARM in the HPC community: Performance and Energy Efficiency Analysis of Benchmarks and Applications Executed on High-End ARM SystemsProceedings of the HPC Asia 2023 Workshops10.1145/3581576.3581618(78-86)Online publication date: 27-Feb-2023
  • (2020)Towards Performant Workflows, Monitoring and Measuring2020 29th International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN49398.2020.9209647(1-9)Online publication date: Aug-2020
  • (2019)A Resource Utilization Analytics Platform Using Grafana and Telegraf for the Savio SuperclusterPractice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)10.1145/3332186.3333053(1-6)Online publication date: 28-Jul-2019
  • (2019)Managing computational gateway resources with XDMoDFuture Generation Computer Systems10.1016/j.future.2019.03.02998:C(154-166)Online publication date: 1-Sep-2019
  • (2018)XD Metrics on Demand Value Analytics: Visualizing the Impact of Internal Information Technology Investments on External Funding, Publications, and Collaboration NetworksFrontiers in Research Metrics and Analytics10.3389/frma.2017.000102Online publication date: 29-Jan-2018
  • (2018)Deep Analysis of Job State Statistics on Lomonosov-2 SupercomputerSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1802015:2(4-10)Online publication date: 15-Jun-2018
  • (2018)PaPaSProceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity10.1145/3219104.3229289(1-8)Online publication date: 22-Jul-2018
  • (2018)A Comprehensive Perspective on Pilot-Job SystemsACM Computing Surveys10.1145/317785151:2(1-32)Online publication date: 17-Apr-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media