Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3508397.3564826acmconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

Towards Anomaly Detection for Monitoring Power Consumption in HPC Facilities

Published: 08 December 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Given the increasing complexity and the heterogeneity of today's computing system infrastructure, power efficiency and fault tolerance remain the top challenges of an High Performance Computing (HPC) facility operation. Recently, many research efforts are focusing on monitoring solutions for collecting, correlating, and analyzing computing infrastructures health events and metrics data to not only identify the normal events but also the anomalous, thus aiding to reduce downtime and power consumption in the face of a computational center's and users' critical needs. In this preliminary work, we present an anomaly detection methodology integrated with the Operations Monitoring and Notification Infrastructure (OMNI) data warehouse at Lawrence Berkeley National Laboratory's (LBNL) National Energy Scientific Computing Center (NERSC) that has implemented anomaly detection algorithms for identifying abnormal power patterns. We evaluated our methodology using five million unlabeled power datasets from the Cori computation system at NERSC and reported on the accuracy of the anomaly detection algorithms in detecting different anomalous behavior pertaining to the amount of power consumed. The methodology is employed to aid in monitoring and automating power alerting to achieve power efficiency and reliability in future systems to be deployed at NERSC or other HPC facilities.

    References

    [1]
    [n. d.]. Cori: NERSC's newest supercomputer. https://www.nersc.gov/users/computational-systems/cori/
    [2]
    [n. d.]. Elasticsearch: Distributed, RESTful Engine. https://www.elastic.co/products/elasticsearch
    [3]
    [n. d.]. Grafana. https://grafana.com/
    [4]
    [n. d.]. Kibana: Your Window into the Elastic Stack. https://www.elastic.co/products/kibana
    [5]
    [n. d.]. ORNL's Frontier First to Break the Exaflop Ceiling. https://www.top500.org/news/ornls-frontier-first-to-break-the-exaflop-ceiling/
    [6]
    [n. d.]. Perlmutter: NERSC's Next Supercomputer. https://www.nersc.gov/systems/perlmutter/
    [7]
    [n. d.]. Prometheus. https://prometheus.io/
    [8]
    [n. d.]. scikit-learn Machine Learning in Python. https://scikit-learn.org/stable/
    [9]
    [n. d.]. ServiceNow. https://www.servicenow.com/
    [10]
    [n. d.]. VictoriaMetrics. https://victoriametrics.com/
    [11]
    Burak Aksar, Yijia Zhang, Emre Ates, Benjamin Schwaller, Omar Aaziz, Vitus J Leung, Jim Brandt, Manuel Egele, and Ayse K Coskun. 2021. Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems. In International Conference on High Performance Computing. Springer, 195--214.
    [12]
    Elizabeth Bautista, Melissa Romanus, Thomas Davis, Cary Whitney, and Theodore Kubaska. 2019. Collecting, Monitoring, and Analyzing Facility and Systems Data at the National Energy Research Scientific Computing Center. In 2019 International Conference on Parallel Processing. ACM, in press.
    [13]
    Elizabeth Bautista, Melissa Romanus, Thomas Davis, Cary Whitney, and Theodore Kubaska. 2019. Collecting, monitoring, and analyzing facility and systems data at the national energy research scientific computing center. In Proceedings of the 48th International Conference on Parallel Processing: Workshops. 1--9.
    [14]
    Elizabeth Bautista, Nitin Sukhija, Melissa Romanus, Thomas Davis, and Cary Whitney. 2022. OMNI at the Edge. In Cybersecurity and High-Performance Computing Environments. Chapman and Hall/CRC, 63--84.
    [15]
    Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008), 181.
    [16]
    Ekaba Bisong. 2019. Matplotlib and seaborn. In Building machine learning and deep learning models on google cloud platform. Springer, 151--165.
    [17]
    Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, and Luca Benini. 2019. A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Engineering Applications of Artificial Intelligence 85 (2019), 634--644.
    [18]
    Andrea Borghesi, Francesca Collina, Michele Lombardi, Michela Milano, and Luca Benini. 2015. Power capping in high performance computing systems. In International Conference on Principles and Practice of Constraint Programming. Springer, 524--540.
    [19]
    Andrea Borghesi, Antonio Libri, Luca Benini, and Andrea Bartolini. 2019. Online anomaly detection in hpc systems. In 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 229--233.
    [20]
    Mohammad Braei and Sebastian Wagner. 2020. Anomaly detection in univariate time-series: A survey on the state-of-the-art. arXiv preprint arXiv:2004.00433 (2020).
    [21]
    Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 93--104.
    [22]
    Evelyn Fix and JL Hodges. 1951. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties USAF School of Aviation Medicine, Randolph Field. Technical Report. Texas, Tech. Report 4.
    [23]
    Stanton A Glantz and Bryan K Slinker. 2001. Primer of applied regression & analysis of variance, ed. Vol. 654. McGraw-Hill, Inc., New York.
    [24]
    Ryan E Grant, Michael Levenhagen, Stephen L Olivier, David DeBonis, Kevin T Pedretti, and James H Laros III. 2016. Standardizing power monitoring and control at exascale. Computer 49, 10 (2016), 38--46.
    [25]
    Gabriel Iuhasz and Dana Petcu. 2019. Perspectives on anomaly and event detection in exascale systems. In 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). IEEE, 225--229.
    [26]
    Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, et al. 2016. Jupyter Notebooks-a publishing format for reproducible computational workflows. Vol. 2016.
    [27]
    Douglas Kothe, Stephen Lee, and Irene Qualters. 2018. Exascale computing in the United States. Computing in Science & Engineering 21, 1 (2018), 17--29.
    [28]
    C Kraaikamp and HLL Meester. 2005. A modern introduction to probability and statistics.
    [29]
    Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413--422.
    [30]
    Wes McKinney et al. 2011. pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14, 9 (2011), 1--9.
    [31]
    Paul Messina. 2017. The exascale computing project. Computing in Science & Engineering 19, 3 (2017), 63--67.
    [32]
    Nurzhan Nurseitov, Michael Paulson, Randall Reynolds, and Clemente Izurieta. 2009. Comparison of JSON and XML data interchange formats: a case study. Caine 9 (2009), 157--162.
    [33]
    Travis E Oliphant. 2006. A guide to NumPy. Vol. 1. Trelgol Publishing USA.
    [34]
    Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883.
    [35]
    Peter J Rousseeuw and Christophe Croux. 1993. Alternatives to the median absolute deviation. Journal of the American Statistical association 88, 424 (1993), 1273--1283.
    [36]
    Skipper Seabold and Josef Perktold. 2010. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, Vol. 57. Austin, TX, 10--25080.
    [37]
    John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In International Conference on High Performance Computing for Computational Science. Springer, 1--25.
    [38]
    Nitin Sukhija and Elizabeth Bautista. 2019. Towards a Framework for Monitoring and Analyzing High Performance Computing Environments Using Kubernetes and Prometheus. In 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (Smart-World/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 257--262.
    [39]
    Nitin Sukhija, Elizabeth Bautista, Owen James, Daniel Gens, Siqi Deng, Yulok Lam, Tony Quan, and Basil Lalli. 2020. Event management and monitoring framework for HPC environments using ServiceNow and Prometheus. In Proceedings of the 12th International Conference on Management of Digital EcoSystems. 149--156.
    [40]
    Nitin Sukhija, Alexander Gessinger, and Elizabeth Bautista. 2020. Towards a Predictive Framework for Power Consumption of Jobs in HPC Facilities. In Proceedings of the 12th International Conference on Management of Digital EcoSystems. 46--47.
    [41]
    Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus J. Leung, Manuel Egele, and Ayse K. Coskun. 2017. Diagnosing Performance Variations in HPC Applications Using Machine Learning. In High Performance Computing, Julian M. Kunkel, Rio Yokota, Pavan Balaji, and David Keyes (Eds.). Springer International Publishing, Cham, 355--373.
    [42]
    Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus J Leung, Manuel Egele, and Ayse K Coskun. 2018. Online diagnosis of performance variation in HPC systems using machine learning. IEEE Transactions on Parallel and Distributed Systems 30, 4 (2018), 883--896.
    [43]
    Guido vanRossum. 1995. Python reference manual. Department of Computer Science [CS] R 9525 (1995).
    [44]
    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods 17, 3 (2020), 261--272.
    [45]
    Yue Zhao, Zain Nasrullah, and Zheng Li. 2019. Pyod: A python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588 (2019).

    Cited By

    View all
    • (2023)Comprehensive Monitoring and Observability with Jenkins and Grafana: A Review of Integration Strategies, Best Practices, and Emerging Trends2023 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT)10.1109/ISMSIT58785.2023.10304904(1-5)Online publication date: 26-Oct-2023
    • (2023)Guidelines for Practicing Responsible Innovation in HPC: A Sociotechnical ApproachDistributed, Ambient and Pervasive Interactions10.1007/978-3-031-34668-2_8(105-118)Online publication date: 23-Jul-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystems
    October 2022
    172 pages
    ISBN:9781450392198
    DOI:10.1145/3508397
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 December 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anomaly detection
    2. big data
    3. fault tolerance
    4. high performance computing
    5. power capping
    6. power consumption

    Qualifiers

    • Research-article

    Conference

    MEDES '22

    Acceptance Rates

    Overall Acceptance Rate 267 of 682 submissions, 39%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Comprehensive Monitoring and Observability with Jenkins and Grafana: A Review of Integration Strategies, Best Practices, and Emerging Trends2023 7th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT)10.1109/ISMSIT58785.2023.10304904(1-5)Online publication date: 26-Oct-2023
    • (2023)Guidelines for Practicing Responsible Innovation in HPC: A Sociotechnical ApproachDistributed, Ambient and Pervasive Interactions10.1007/978-3-031-34668-2_8(105-118)Online publication date: 23-Jul-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media