Introducing the Metric Proxy for Holistic I/O Measurements

Besnard, Jean-Baptiste; Tarraf, Ahmad; Cascajo, Alberto; Shende, Sameer

doi:10.1007/978-3-031-73716-9_15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15058))

Included in the following conference series:

International Conference on High Performance Computing

286 Accesses

Abstract

High-Performance Computing (HPC) systems face a wide spectrum of I/O patterns from various sources including workflows, in-Situ data operations, or from Ad-hoc file storages. However, accurately monitoring these workloads at scale is challenging due to multiple layers’ interference on system performance metrics. The metric proxy addresses this by providing real-time insights into system states, reducing overhead and storage constraints. By utilizing a Tree-Based Overlay Network (TBON) topology, it efficiently collects metrics across nodes in HPC systems. This paper explores the conceptual foundation of the metric proxy, its architecture design, and how it can be used to improve I/O performance modelling and detection of periodic I/O workload patterns, ultimately aiding in more informed system optimization strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

The CMS monitoring infrastructure and applications

Article Open access 24 January 2021

Distributed Kubernetes Metrics Aggregation

Notes

References

Graphite. https://graphiteapp.org/. Accessed 27 Feb 2024
InfluxData. https://www.influxdata.com/. Accessed 27 Feb 2024
Nagios. https://www.nagios.org/. Accessed 27 Feb 2024
OpenTSDB. http://opentsdb.net/. Accessed 27 Feb 2024
Prometheus time series database. https://prometheus.io/. Accessed 27 Feb 2024
Sensu. https://sensu.io/. Accessed 27 Feb 2024
Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Experience 6, 685–701 (2010)
Article Google Scholar
Aggarwal, V., Yoon, C., George, A., Lam, H., Stitt, G.: Performance modeling for multilevel communication in shmem+. In: PGAS ’10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/2020373.2020380
Aldinucci, M., et al.: HPC4AI, an AI-on-demand federated platform endeavour. In: ACM Computing Frontiers. Ischia, Italy (2018). https://doi.org/10.1145/3203217.3205340
Betke, E., Kunkel, J.: Footprinting parallel i/o–machine learning to classify application’s i/o behavior. In: High Performance Computing: ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16–20, 2019, Revised Selected Papers 34, pp. 214–226. Springer (2019)
Google Scholar
Bhattacharyya, A., Hoefler, T.: PEMOGEN: automatic adaptive performance modeling during program runtime. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14), pp. 393–404. ACM (2014)
Google Scholar
Boehme, D., et al.: Caliper: performance introspection for HPC software stacks. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 550–560. IEEE (2016)
Google Scholar
Boito, F., Pallez, G., Teylo, L., Vidal, N.: IO-sets: simple and efficient approaches for I/O bandwidth management. IEEE Trans. Parallel Distrib. Syst. 34(10), 2783–2796 (2023)
Article Google Scholar
Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC13), Denver, CO, USA. pp. 1–12. ACM (2013). https://doi.org/10.1145/2503210.2503277
Carretero, J., et al.: Adaptive multi-tier intelligent data manager for exascale. In: Proceedings of the 20th ACM International Conference on Computing Frontiers, pp. 285–290 (2023)
Google Scholar
Cascajo, A., Singh, D.E., Carretero, J.: Limitless-light-weight monitoring tool for large scale systems. Microprocess. Microsyst. 93, 104586 (2022)
Article Google Scholar
Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: IPDPS’14, pp. 155–164. IEEE (2014)
Google Scholar
Eitzinger, J., Gruber, T., Afzal, A., Zeiser, T., Wellein, G.: Clustercockpit-a web application for job-specific performance monitoring. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–7. IEEE (2019)
Google Scholar
Eller, P.R., Hoefler, T., Gropp, W.: Using performance models to understand scalable krylov solver performance at scale for structured grid problems. In: ICS ’19: Proceedings of the ACM International Conference on Supercomputing, pp. 138–149. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3330345.3330358
Evans, T., et al.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21. IEEE (2014)
Google Scholar
Forschungszentrum Jülich, J.S.C.: LLview. https://github.com/FZJ-JSC/LLview. Accessed 30 Apr 2024
Gabriel Jr, D.J.: I/O throughput prediction for HPC applications using darshan logs. Ph.D. thesis, University of Nevada, Reno (2022)
Google Scholar
Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurrency Comput. Pract. Experience 22(6), 702–719 (2010)
Article Google Scholar
Grafana (2023). https://github.com/grafana/grafana
Izadpanah, R., Naksinehaboon, N., Brandt, J., Gentile, A., Dechev, D.: Integrating low-latency analysis into HPC system monitoring. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)
Google Scholar
Jeannot, E., Pallez, G., Vidal, N.: Scheduling periodic I/O access with bi-colored chains: models and algorithms. J. Sched. 24(5), 469–481 (2021)
Article MathSciNet Google Scholar
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope,Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M.S., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7
Kunkel, J.M., et al.: Tools for analyzing parallel I/O. In: High Performance Computing: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers 33, pp. 49–70. Springer (2018)
Google Scholar
Lee, C.W., Malony, A.D., Morris, A.: Taumon: scalable online performance data analysis in tau. In: Euro-Par 2010 Parallel Processing Workshops: HeteroPar, HPCC, HiBB, CoreGrid, UCHPC, HPCF, PROPER, CCPI, VHPC, Ischia, Italy, August 31–September 3, 2010, Revised Selected Papers 16, pp. 493–499. Springer (2011)
Google Scholar
Marathe, A., et al.: Performance modeling under resource constraints using deep transfer learning. In: SC ’17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3126908.3126969
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Article Google Scholar
Matsakis, N.D., Klock II, F.S.: The rust language. In: ACM SIGAda Ada Letters, vol. 34, pp. 103–104. ACM (2014)
Google Scholar
Morris, A., Spear, W., Malony, A.D., Shende, S.: Observing performance dynamics using parallel profile snapshots. In: Euro-Par 2008–Parallel Processing: 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, August 26–29, 2008. Proceedings 14, pp. 162–171. Springer (2008)
Google Scholar
Netti, A., et al.: From facility to application sensor data: modular, continuous and holistic monitoring with DCDB. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–27 (2019)
Google Scholar
Netti, A., et al.: DCDB wintermute: enabling online and holistic operational data analytics on HPC systems. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 101–112 (2020)
Google Scholar
Nikitenko, D., et al.: Jobdigest–detailed system monitoring-based supercomputer application behavior analysis. In: Russian Supercomputing Days, pp. 516–529. Springer (2017)
Google Scholar
Obaida, M.A., Liu, J., Chennupati, G., Santhi, N., Eidenbenz, S.: Parallel application performance prediction using analysis based models and HPC simulations. In: SIGSIM-PADS ’18: Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, pp. 49–59. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3200921.3200937
Price, J., McIntosh-Smith, S.: Improving auto-tuning convergence times with dynamically generated predictive performance models. In: MCSOC ’15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, pp. 211–218. IEEE Computer Society, USA (2015).https://doi.org/10.1109/MCSoC.2015.31
Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 21 (2003)
Google Scholar
Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006). https://doi.org/10.1177/1094342006064482
Article Google Scholar
Sodhi, S., Subhlok, J., Xu, Q.: Performance prediction with skeletons. Cluster Comput. 11(2), 151–165 (2008). https://doi.org/10.1007/s10586-007-0039-2
Article Google Scholar
Sun, J., Sun, G., Zhan, S., Zhang, J., Chen, Y.: Automated performance modeling of HPC applications using machine learning. IEEE Trans. Comput. 5, 749–763 (2020). https://doi.org/10.1109/TC.2020.2964767
Article Google Scholar
Tarraf, A., Bandet, A., Boito, F., Pallez, G., Wolf, F.: Capturing periodic I/O using frequency techniques. In: Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, USA, pp. 1–14. IEEE (2024)
Google Scholar
Vef, M.A., et al.: Gekkofs - a temporary distributed file system for HPC applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324 (2018). https://doi.org/10.1109/CLUSTER.2018.00049
Weiss, A., Gierczak, O., Patterson, D., Ahmed, A.: Oxide: the essence of rust. arXiv preprint arXiv:1903.00982 (2019)
Yang, W., Liao, X., Dong, D., Yu, J.: A quantitative study of the spatiotemporal I/O burstiness of HPC application. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1349–1359 (2022). https://doi.org/10.1109/IPDPS53621.2022.00133

Download references

Acknowledgment

We acknowledge the support of the European Commission and the German Federal Ministry of Education and Research (BMBF) under the EuroHPC Programmes ADMIRE (GA no. 956748, BMBF funding no. 16HPC006K) which receive support from the European Union’s Horizon 2020 program and DE, FR, ES, IT, PL, SE. This work was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project No. 449683531 (ExtraNoise). Moreover, the authors gratefully acknowledge the computing time provided to them on the high-performance computer at the University of Turin from the laboratory on High-Performance Computing for Artificial Intelligence [9].

Author information

Authors and Affiliations

ParaTools SAS, Bruyères-le-Châtel, France
Jean-Baptiste Besnard & Sameer Shende
Department of Computer Science, Technical University of Darmstadt, Darmstadt, Germany
Ahmad Tarraf
Computer Science and Engineering Department, University Carlos III of Madrid, Madrid, Spain
Alberto Cascajo

Authors

Jean-Baptiste Besnard
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Tarraf
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cascajo
View author publications
You can also search for this author in PubMed Google Scholar
Sameer Shende
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Baptiste Besnard .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Michèle Weiland
Johannes Gutenberg University Mainz, Mainz, Germany
Sarah Neuwirth
Cerfacs, Toulouse, France
Carola Kruse
Durham University, Durham, UK
Tobias Weinzierl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Besnard, JB., Tarraf, A., Cascajo, A., Shende, S. (2025). Introducing the Metric Proxy for Holistic I/O Measurements. In: Weiland, M., Neuwirth, S., Kruse, C., Weinzierl, T. (eds) High Performance Computing. ISC High Performance 2024 International Workshops. ISC High Performance 2023. Lecture Notes in Computer Science, vol 15058. Springer, Cham. https://doi.org/10.1007/978-3-031-73716-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-73716-9_15
Published: 14 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73715-2
Online ISBN: 978-3-031-73716-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Introducing the Metric Proxy for Holistic I/O Measurements

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

The CMS monitoring infrastructure and applications

Distributed Kubernetes Metrics Aggregation

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Introducing the Metric Proxy for Holistic I/O Measurements

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

The CMS monitoring infrastructure and applications

Distributed Kubernetes Metrics Aggregation

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation