Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063348.2063382acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

SPOTlight on testing: stability, performance and operational testing of LANL HPC clusters

Published: 12 November 2011 Publication History

Abstract

Testing is sometimes a forgotten component of system management, but it becomes very important in the realm of High Performance Computing (HPC) clusters. Many large-scale HPC cluster installations are one of a kind, with unknown issues and unexpected behaviors. First, the initial installation may uncover complex configuration interactions that are only apparent at scale; Stability becomes a critical feature of early system testing. Second, Performance may be significantly impacted by small changes to the system. Third, after initial shakeout, users expect a system that is reliable on their terms; ongoing Operational tests verify reliability, and provide early warning of developing problems. A robust test suite should address all of these test categories, and present both tests and results in a manner that meets usability requirements. We will describe Los Alamos National Laboratory's current test suite, and the development project to expand the suite to cover these areas and provide better tools for analysis and reporting.

References

[1]
The Agile Manifesto. http://agilemanifesto.org/ 2001.
[2]
Avizienis, Algirdas, Laprie, Jean-Claude, Randell, Brian and Landwehr, Carl. 2004. Basic concepts and taxonomy of dependable and secure computing. In IEEE Transactions on Dependable and Secure Computing, 1(1):11--33.
[3]
Davis, L. P., R. L. Campbell Jr., W. A. Ward Jr., and C. J. Henry. 2007. High-Performance Computing Acquisitions Based on the Factors that Matter. Computing in Science and Engineering, vol. 9, no. 6, pp. 35--44.
[4]
Ghemawat, S., Gobioff, H. and Leung, S. T. 2003. The Google file system. In Proc. Of the 19th ACM Symposium on Operating Systems Principles (SOSP'03).
[5]
Graham, S., Snir, M., and Patterson, C. Eds. 2005. Getting up to Speed: The Future of Supercomputing, National Academies Press. Washington, D. C.
[6]
Koniges A. E. Ed. 2000. Industrial Strength Parallel Computing, Morgan Kaufmann Publishers. San Francisco, CA.
[7]
Mitchell, B. IBM Blue Gene: the world's most advanced supercomputer from International Business Machines will tackle Grand Challenge problems. http://compnetworkingabout.com/library/weekly/aa051902a.htm. Accessed June, 2011.
[8]
Müller, M. S., Juckeland, G., Jurenz, M., and Kluge, M., 2007. Quality Assurance for Clusters: Acceptance-, Stress-, and Burn-In Tests for General Purpose Clusters. SpringerVerlag Berlin Heidelberg.
[9]
Ogden, J. Cbench: A Software Toolkit for Testing, Benchmarking, and Qualifying HPTC Linux Clusters. Whitepaper. Sandia National Laboratory. http://sourceforge.net/projects/cbench-sf
[10]
Smallen, S., Olschanowsky, C., Ericson, K., Beckman, P., Schopf, J. M. The Inca Test Harness and Reporting Framework. 2004. In Proceedings of SuperComputing '04.
[11]
Stearley, J. 2005. Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS). Proceedings of the Linux Clusters Institute Conference, 2005.
[12]
Stearley, J. 2005. Towards a Specification for Measuring Red Storm Reliability, Availability, and Serviceability (RAS), Cray Users Group Conference.
[13]
TETWare. 2009. The Open Group. tetworks.opengroup.org.

Cited By

View all
  • (2023)Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity ClustersHigh Performance Computing10.1007/978-3-031-40843-4_33(444-457)Online publication date: 25-Aug-2023
  • (2017)TestpilotProceedings of the Fourth International Workshop on HPC User Support Tools10.1145/3152493.3152555(1-10)Online publication date: 12-Nov-2017
  • (2014)Model-Driven Resilience Assessment of Modifications to HPC InfrastructuresEuro-Par 2013: Parallel Processing Workshops10.1007/978-3-642-54420-0_69(707-716)Online publication date: 2014

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: State of the Practice Reports
November 2011
242 pages
ISBN:9781450311397
DOI:10.1145/2063348
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RAS
  2. SPOT
  3. accessibility
  4. high performance computing
  5. operational testing
  6. performance testing
  7. reliability
  8. serviceability
  9. stability testing
  10. test driven development
  11. test framework

Qualifiers

  • Research-article

Conference

SC '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity ClustersHigh Performance Computing10.1007/978-3-031-40843-4_33(444-457)Online publication date: 25-Aug-2023
  • (2017)TestpilotProceedings of the Fourth International Workshop on HPC User Support Tools10.1145/3152493.3152555(1-10)Online publication date: 12-Nov-2017
  • (2014)Model-Driven Resilience Assessment of Modifications to HPC InfrastructuresEuro-Par 2013: Parallel Processing Workshops10.1007/978-3-642-54420-0_69(707-716)Online publication date: 2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media