Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2903150.2903481acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
poster

Resolving frontier problems of mastering large-scale supercomputer complexes

Published: 16 May 2016 Publication History

Abstract

Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.

References

[1]
Dongarra, J. 2013. Visit to the National University for Defense Technology Changsha, China. Oak Ridge National Laboratory. June 3, 2013. http://www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf
[2]
Ricoux, P. 2015. Addressing the Challenge of Exascale. The European Exascale Software Initiative, EESI2 Final Conference. http://www.prace-ri.eu/IMG/pdf/pd15-EESI2_Final-Conference_All-Presentation_Day-1_V2.pdf
[3]
SLURM workload manager http://slurm.schedmd.com
[4]
Open-source Ticket Request System http://www.otrs.org
[5]
Ganglia Monitoring System http://ganglia.sourceforge.net
[6]
Zabbix Monitoring http://www.zabbix.com
[7]
Nagios Monitoring https://www.nagios.org
[8]
Bright Cluster Manager http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc
[9]
Nikitenko D. et al. 2016 Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. Parallel Computational Technologies (PCT'2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, 2016. 20--30.
[10]
Voevodin Vl. et al. 2012 Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numerical Methods and Programming. 2012. Vol. 13. 160--166. Stefanov K. et al. 2015. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science. Elsevier. 2015. Vol. 66. 625--634.
[11]
Shvets P. et al. 2015. An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model. 11th Int. Conference on Parallel Processing and Applied Mathematics, Krakow, Poland, 6-9 September 2015.

Cited By

View all
  • (2019)Evolution of the Octoshell HPC Center Management SystemParallel Computational Technologies10.1007/978-3-030-28163-2_2(19-33)Online publication date: 2-Aug-2019
  • (2019)HPC Software for Massive Analysis of the Parallel Efficiency of ApplicationsParallel Computational Technologies10.1007/978-3-030-28163-2_1(3-18)Online publication date: 2-Aug-2019
  • (2018)Modeling parallel processing of databases on the central processor Intel Xeon Phi KNL2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)10.23919/MIPRO.2018.8400288(1605-1610)Online publication date: May-2018
  • Show More Cited By

Index Terms

  1. Resolving frontier problems of mastering large-scale supercomputer complexes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '16: Proceedings of the ACM International Conference on Computing Frontiers
    May 2016
    487 pages
    ISBN:9781450341288
    DOI:10.1145/2903150
    • General Chairs:
    • Gianluca Palermo,
    • John Feo,
    • Program Chairs:
    • Antonino Tumeo,
    • Hubertus Franke
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2016

    Check for updates

    Author Tags

    1. automation of administering routines
    2. fault-tolerant administering
    3. large-scale system administering
    4. managing HPC systems
    5. user support

    Qualifiers

    • Poster

    Conference

    CF'16
    Sponsor:
    CF'16: Computing Frontiers Conference
    May 16 - 19, 2016
    Como, Italy

    Acceptance Rates

    CF '16 Paper Acceptance Rate 30 of 94 submissions, 32%;
    Overall Acceptance Rate 273 of 785 submissions, 35%

    Upcoming Conference

    CF '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Evolution of the Octoshell HPC Center Management SystemParallel Computational Technologies10.1007/978-3-030-28163-2_2(19-33)Online publication date: 2-Aug-2019
    • (2019)HPC Software for Massive Analysis of the Parallel Efficiency of ApplicationsParallel Computational Technologies10.1007/978-3-030-28163-2_1(3-18)Online publication date: 2-Aug-2019
    • (2018)Modeling parallel processing of databases on the central processor Intel Xeon Phi KNL2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)10.23919/MIPRO.2018.8400288(1605-1610)Online publication date: May-2018
    • (2018)Deep Analysis of Job State Statistics on Lomonosov-2 SupercomputerSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1802015:2(4-10)Online publication date: 15-Jun-2018
    • (2018)Improved Process of Running Tasks in the High Performance Computing System2018 16th International Conference on Emerging eLearning Technologies and Applications (ICETA)10.1109/ICETA.2018.8572230(133-140)Online publication date: Nov-2018
    • (2018)Role-Dependent Resource Utilization Analysis for Large HPC CentersParallel Computational Technologies10.1007/978-3-319-99673-8_4(47-61)Online publication date: 26-Aug-2018
    • (2017)Model of education and training strategy for the management of HPC systems2017 IEEE 14th International Scientific Conference on Informatics10.1109/INFORMATICS.2017.8327282(400-405)Online publication date: Nov-2017
    • (2017)JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior AnalysisSupercomputing10.1007/978-3-319-71255-0_42(516-529)Online publication date: 15-Nov-2017
    • (2017)The Top50 List Vivification in the Evolution of HPC RankingsParallel Computational Technologies10.1007/978-3-319-67035-5_2(14-26)Online publication date: 27-Sep-2017
    • (2016)System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC CenterAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-49956-7_24(305-318)Online publication date: 19-Nov-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media