Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3369583.3392674acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems

Published: 23 June 2020 Publication History

Abstract

As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reality in HPC, there is no well-stated and comprehensive list of requirements, nor matching frameworks, to support holistic and online ODA. This leads to insular ad-hoc solutions, each addressing only specific aspects of the problem.
In this paper we propose Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB monitoring system, offering a large variety of configuration options to accommodate the varying requirements of ODA applications. Moreover, Wintermute is based on a set of logical abstractions to ease the configuration of models at a large scale and maximize code re-use. We highlight Wintermute's flexibility through a series of practical case studies, each targeting a different aspect of the management of HPC systems, and then demonstrate the small resource footprint of our implementation.

Supplementary Material

MP4 File (3369583.3392674.mp4)
In this talk we present Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB monitoring system, offering a large variety of configuration options to accommodate the varying requirements of ODA applications. Moreover, Wintermute is based on a set of logical abstractions to ease the configuration of models at a large scale and maximize code re-use. We highlight Wintermute?s flexibility through a series of practical case studies, each targeting a different aspect of the management of HPC systems, and then demonstrate the small resource footprint of our implementation.

References

[1]
Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, et al. 2014. The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In Proc. of SC 2014. IEEE, 154--165.
[2]
Ville Ahlgren, Stefan Andersson, Jim Brandt, Nicholas Cardo, et al. 2018. Large-Scale System Monitoring Experiences and Recommendations. In Proc. of CLUSTER 2018. IEEE, 532--542.
[3]
Emre Ates, Ozan Tuncer, Ata Turk, Vitus J. Leung, et al. 2018. Taxonomist: Application Detection Through Rich Monitoring Data. In Proc. of Euro-Par 2018. Springer.
[4]
Axel Auweter, Arndt Bode, Matthias Brehm, Luigi Brochard, et al. 2014. A case study of energy aware scheduling on SuperMUC. In Proc. of ISC 2014. Springer, 394--409.
[5]
Ozalp Babaoglu and Alina Sirbu. 2018. Cognified Distributed Computing. In Proc. of ICDCS 2018. IEEE, 1180--1191.
[6]
Cullen Bash and George Forman. 2007. Cool Job Allocation: Measuring the Power Savings of Placing Jobs at Cooling-Efficient Locations in the Data Center. In Proc. of USENIX 2007, Vol. 138. 140.
[7]
Elizabeth Bautista, Melissa Romanus, Thomas Davis, Cary Whitney, et al. 2019. Collecting, Monitoring, and Analyzing Facility and Systems Data at the National Energy Research Scientific Computing Center. In Proc. of the ICPP 2019 Workshops. ACM, 10.
[8]
Francesco Beneventi, Andrea Bartolini, Carlo Cavazzoni, and Luca Benini. 2017. Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools. In Proc. of DATE 2017. IEEE, 1038--1043.
[9]
Norman Bourassa, Walker Johnson, Jeff Broughton, Deirdre McShane Carter, et al. 2019. Operational Data Analytics: Optimizing the National Energy Research Scientific Computing Center Cooling Systems. In Proc. of the ICPP 2019 Workshops. ACM, 5:1--5:7.
[10]
Norman Bourassa and Michael Ott. 2019. EEHPCWG Operational Data Analytics Survey. https://eehpcwg.llnl.gov/assets/sc19_11_425_525_operational_data_analytics_ott_bourassa.pdf
[11]
Franck Cappello, Al Geist, William Gropp, Sanjay Kale, et al. 2014. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, Vol. 1, 1 (2014), 5--28.
[12]
Christian Conficoni, Andrea Bartolini, Andrea Tilli, Giampietro Tecchiolli, et al. 2015. Energy-aware cooling for hot-water cooled supercomputers. In Proc. of DATE 2015. IEEE, 1353--1358.
[13]
Julita Corbalan and Luigi Brochard. submitted. EAR: Energy management framework for supercomputers. In Proc. of IPDPS 2018. IEEE.
[14]
Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience, Vol. 15, 9 (2003), 803--820.
[15]
Jonathan Eastep, Steve Sylvester, Christopher Cantalupo, Brad Geltz, et al. 2017. Global extensible open power manager: A vehicle for HPC community collaboration on co-designed energy management solutions. In Proc. of ISC 2017. Springer, 394--412.
[16]
Jorge Ejarque, Andras Micsik, Raul Sirvent, Peter Pallinger, et al. 2010. Semantic resource allocation with historical data based predictions. In Proc. of CLOUD 2010. IARIA.
[17]
Joseph Emeras, Sébastien Varrette, Mateusz Guzek, and Pascal Bouvry. 2015. Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms. In Proc. of JSSPP 2015. Springer, 102--122.
[18]
Cristian Galleguillos, Alina Sirbu, Zeynep Kiziltan, Ozalp Babaoglu, et al. 2017. Data-driven job dispatching in HPC systems. In Proc. of MOD 2017. Springer, 449--461.
[19]
Steven M Gallo, Joseph P White, Robert L DeLeon, Thomas R Furlani, et al. 2015. Analysis of XDMoD/SUPReMM Data Using Machine Learning Techniques. In Proc. of CLUSTER 2015. IEEE, 642--649.
[20]
Alfredo Giménez, Todd Gamblin, Abhinav Bhatele, Chad Wood, et al. 2017. ScrubJay: deriving knowledge from the disarray of HPC performance data. In Proc. of SC 2017. ACM, 35.
[21]
Ryan E Grant, Kevin T Pedretti, and Ann Gentile. 2015. Overtime: A tool for analyzing performance variation due to network interference. In Proc. of the Exascale MPI Workshop 2015. ACM, 4.
[22]
Dalvan Griebler, Daniele De Sensi, Adriano Vogel, Marco Danelutto, et al. 2018. Service Level Objectives via C+ 11 Attributes. In Proc. of REPARA Workshop 2018. Springer.
[23]
Qiang Guan and Song Fu. 2013. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Proc. of SRDS 2013. IEEE, 205--214.
[24]
Carla Guillen, Wolfram Hesse, and Matthias Brehm. 2014. The PerSyst Monitoring Tool - A Transport System for Performance Data Using Quantiles. In Proc. of the Euro-Par 2014 Workshops. Springer, 363--374.
[25]
Connor Imes, Steven Hofmeyr, and Henry Hoffmann. 2018. Energy-efficient Application Resource Scheduling using Machine Learning Classifiers. In Proc. of ICPP 2018. ACM, 45.
[26]
Ramin Izadpanah, Nichamon Naksinehaboon, Jim Brandt, Ann Gentile, et al. 2018. Integrating Low-latency Analysis into HPC System Monitoring. In Proc. of ICPP 2018. ACM, 5.
[27]
Saurabh Jha, Jim Brandt, Ann Gentile, Zbigniew Kalbarczyk, et al. 2018. Characterizing Supercomputer Traffic Networks Through Link-Level Analysis. In Proc. of CLUSTER 2018. IEEE, 562--570.
[28]
Weixiang Jiang, Ziyang Jia, Sirui Feng, Fangming Liu, et al. 2019. Fine-grained Warm Water Cooling for Improving Datacenter Economy. In Proc. of ISCA 2019. ACM, 474--486.
[29]
Rashawn L Knapp, Kathryn Mohror, Aaron Amauba, Karen L Karavanic, et al. 2007. PerfTrack: Scalable application performance diagnosis for linux clusters. In Proc. of LCI 2007. Citeseer, 15--17.
[30]
X. Lin, Y. Wang, and M. Pedram. 2016. A Reinforcement Learning-Based Power Management Framework for Green Computing Data Centers. In Proc. of IC2E 2016. IEEE, 135--138.
[31]
Dave Locke. 2010. Mq telemetry transport (mqtt) v3. 1 protocol specification. IBM developerWorks Technical Library (2010), 15.
[32]
Matthew L Massie, Brent N Chun, and David E Culler. 2004. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput., Vol. 30, 7 (2004), 817--840.
[33]
Andréa Matsunaga and José AB Fortes. 2010. On the use of machine learning to predict the time and resources consumed by applications. In Proc. of CCGrid 2010. IEEE, 495--504.
[34]
Ryan McKenna, Stephen Herbein, Adam Moody, Todd Gamblin, et al. 2016. Machine learning predictions of runtime and IO traffic on high-end clusters. In Proc. of CLUSTER 2016. IEEE, 255--258.
[35]
Mina Naghshnejad and Mukesh Singhal. 2018. Adaptive Online Runtime Prediction to Improve HPC Applications Latency in Cloud. In Proc. of CLOUD 2018. IEEE, 762--769.
[36]
Alessio Netti, Micha Mueller, Axel Auweter, Carla Guillen, et al. 2019. From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB. In Proc. of SC 2019. ACM.
[37]
Gence Ozer, Sarthak Garg, Neda Davoudi, Gabrielle Poerwawinata, et al. 2019. Towards a Predictive Energy Model for HPC Runtime Systems Using Supervised Learning. In Proc. of PMACS Workshop 2019. Springer.
[38]
Stephen J Roberts, Dirk Husmeier, Iead Rezek, and William Penny. 1998. Bayesian approaches to Gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, 11 (1998), 1133--1142.
[39]
Denis Shaykhislamov and Vadim Voevodin. 2018. An approach for dynamic detection of inefficient supercomputer applications. Procedia Computer Science, Vol. 136 (2018), 35--43.
[40]
Alina Sirbu and Ozalp Babaoglu. 2016a. Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In Proc. of Euro-Par 2016. Springer, 117--130.
[41]
Alina Sirbu and Ozalp Babaoglu. 2016b. Towards operator-less data centers through data-driven, predictive, proactive autonomics. Cluster Computing, Vol. 19, 2 (2016), 865--878.
[42]
Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, et al. 2018. Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning. IEEE Transactions on Parallel and Distributed Systems (2018).
[43]
Sudharshan S Vazhkudai, Ross Miller, Devesh Tiwari, Christopher Zimmer, et al. 2017. GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. In Proc. of SC 2017. 1--12.
[44]
Akshat Verma, Puneet Ahuja, and Anindya Neogi. 2008. Power-aware dynamic placement of hpc applications. In Proc. of ICS 2008. ACM, 175--184.
[45]
Oreste Villa, Daniel R Johnson, Mike Oconnor, Evgeny Bolotin, et al. 2014. Scaling the power wall: a path to exascale. In Proc. of SC 2014. IEEE, 830--841.
[46]
Z. Wang, Z. Tian, J. Xu, R. K. V. Maeda, et al. 2017. Modular Reinforcement Learning for Self-Adaptive Energy Efficiency Optimization in Multicore System. In Proc. of ASP-DAC 2017. IEEE, 684--689.
[47]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, Vol. 52, 4 (2009), 65--76.
[48]
Michael R Wyatt II, Stephen Herbein, Todd Gamblin, Adam Moody, et al. 2018. PRIONN: Predicting Runtime and IO using Neural Networks. In Proc. of ICPP 2018. ACM, 46.
[49]
Ji Xue, Feng Yan, Robert Birke, Lydia Y Chen, et al. 2015. PRACTISE: Robust prediction of data center time series. In Proc. of CNSM 2015. IEEE, 126--134.
[50]
Hao Zhang, Haihang You, Bilel Hadri, and Mark Fahey. 2012. HPC usage behavior analysis and performance estimation with machine learning techniques. In Proc. of PDPTA 2012. 1.

Cited By

View all
  • (2024)zns-tools: An eBPF-powered, Cross-Layer Storage Profiling Tool for NVMe ZNS SSDsProceedings of the 4th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems10.1145/3642963.3652205(23-32)Online publication date: 22-Apr-2024
  • (2024)From the Physics Lab to the Computer Lab: Towards Flexible and Comprehensive DevOps for Quantum ComputingProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3653432(139-143)Online publication date: 7-May-2024
  • (2024)Navigating Exascale Operational Data Analytics: From Inundation to InsightSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00226(1795-1804)Online publication date: 17-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
June 2020
246 pages
ISBN:9781450370523
DOI:10.1145/3369583
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high-performance computing
  2. monitoring
  3. online analysis
  4. operational data analytics
  5. system management

Qualifiers

  • Research-article

Funding Sources

Conference

HPDC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)zns-tools: An eBPF-powered, Cross-Layer Storage Profiling Tool for NVMe ZNS SSDsProceedings of the 4th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems10.1145/3642963.3652205(23-32)Online publication date: 22-Apr-2024
  • (2024)From the Physics Lab to the Computer Lab: Towards Flexible and Comprehensive DevOps for Quantum ComputingProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3653432(139-143)Online publication date: 7-May-2024
  • (2024)Navigating Exascale Operational Data Analytics: From Inundation to InsightSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00226(1795-1804)Online publication date: 17-Nov-2024
  • (2024)Evolving Large Scale HPC Monitoring & Analysis to Track Modern Dynamic Environments2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)10.1109/CLUSTERWorkshops61563.2024.00016(36-43)Online publication date: 24-Sep-2024
  • (2024)A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00060(478-488)Online publication date: 6-May-2024
  • (2024)A review on the decarbonization of high-performance computing centersRenewable and Sustainable Energy Reviews10.1016/j.rser.2023.114019189(114019)Online publication date: Jan-2024
  • (2024)Introducing the Metric Proxy for Holistic I/O MeasurementsHigh Performance Computing. ISC High Performance 2024 International Workshops10.1007/978-3-031-73716-9_15(213-226)Online publication date: 14-Dec-2024
  • (2024)Debugging Big Data Systems for Big Data AnalyticsBig Data Analytics10.1007/978-3-031-55639-5_8(171-192)Online publication date: 8-May-2024
  • (2023)Quantum Computer Metrics and HPC Center Environmental Sensor Data Analysis Towards Fidelity Prediction2023 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE57702.2023.10200(154-160)Online publication date: 17-Sep-2023
  • (2023) Towards bespoke optimizations of energy efficiency in HPC environments Applied AI Letters10.1002/ail2.87Online publication date: 13-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media