research-article

DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems

Authors:

Martin SchulzAuthors Info & Claims

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Pages 101 - 112

https://doi.org/10.1145/3369583.3392674

Published: 23 June 2020 Publication History

Get Access

Abstract

As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reality in HPC, there is no well-stated and comprehensive list of requirements, nor matching frameworks, to support holistic and online ODA. This leads to insular ad-hoc solutions, each addressing only specific aspects of the problem.

In this paper we propose Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB monitoring system, offering a large variety of configuration options to accommodate the varying requirements of ODA applications. Moreover, Wintermute is based on a set of logical abstractions to ease the configuration of models at a large scale and maximize code re-use. We highlight Wintermute's flexibility through a series of practical case studies, each targeting a different aspect of the management of HPC systems, and then demonstrate the small resource footprint of our implementation.

Supplementary Material

MP4 File (3369583.3392674.mp4)

In this talk we present Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB monitoring system, offering a large variety of configuration options to accommodate the varying requirements of ODA applications. Moreover, Wintermute is based on a set of logical abstractions to ease the configuration of models at a large scale and maximize code re-use. We highlight Wintermute?s flexibility through a series of practical case studies, each targeting a different aspect of the management of HPC systems, and then demonstrate the small resource footprint of our implementation.

Download
440.57 MB

References

[1]

Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, et al. 2014. The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In Proc. of SC 2014. IEEE, 154--165.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Operational Data Analytics: Optimizing the National Energy Research Scientific Computing Center Cooling Systems

Collecting, Monitoring, and Analyzing Facility and Systems Data at the National Energy Research Scientific Computing Center

Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations