Abstract
High-Performance Computing (HPC) systems face a wide spectrum of I/O patterns from various sources including workflows, in-Situ data operations, or from Ad-hoc file storages. However, accurately monitoring these workloads at scale is challenging due to multiple layers’ interference on system performance metrics. The metric proxy addresses this by providing real-time insights into system states, reducing overhead and storage constraints. By utilizing a Tree-Based Overlay Network (TBON) topology, it efficiently collects metrics across nodes in HPC systems. This paper explores the conceptual foundation of the metric proxy, its architecture design, and how it can be used to improve I/O performance modelling and detection of periodic I/O workload patterns, ultimately aiding in more informed system optimization strategies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Graphite. https://graphiteapp.org/. Accessed 27 Feb 2024
InfluxData. https://www.influxdata.com/. Accessed 27 Feb 2024
Nagios. https://www.nagios.org/. Accessed 27 Feb 2024
OpenTSDB. http://opentsdb.net/. Accessed 27 Feb 2024
Prometheus time series database. https://prometheus.io/. Accessed 27 Feb 2024
Sensu. https://sensu.io/. Accessed 27 Feb 2024
Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Experience 6, 685–701 (2010)
Aggarwal, V., Yoon, C., George, A., Lam, H., Stitt, G.: Performance modeling for multilevel communication in shmem+. In: PGAS ’10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/2020373.2020380
Aldinucci, M., et al.: HPC4AI, an AI-on-demand federated platform endeavour. In: ACM Computing Frontiers. Ischia, Italy (2018). https://doi.org/10.1145/3203217.3205340
Betke, E., Kunkel, J.: Footprinting parallel i/o–machine learning to classify application’s i/o behavior. In: High Performance Computing: ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16–20, 2019, Revised Selected Papers 34, pp. 214–226. Springer (2019)
Bhattacharyya, A., Hoefler, T.: PEMOGEN: automatic adaptive performance modeling during program runtime. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14), pp. 393–404. ACM (2014)
Boehme, D., et al.: Caliper: performance introspection for HPC software stacks. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 550–560. IEEE (2016)
Boito, F., Pallez, G., Teylo, L., Vidal, N.: IO-sets: simple and efficient approaches for I/O bandwidth management. IEEE Trans. Parallel Distrib. Syst. 34(10), 2783–2796 (2023)
Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC13), Denver, CO, USA. pp. 1–12. ACM (2013). https://doi.org/10.1145/2503210.2503277
Carretero, J., et al.: Adaptive multi-tier intelligent data manager for exascale. In: Proceedings of the 20th ACM International Conference on Computing Frontiers, pp. 285–290 (2023)
Cascajo, A., Singh, D.E., Carretero, J.: Limitless-light-weight monitoring tool for large scale systems. Microprocess. Microsyst. 93, 104586 (2022)
Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: IPDPS’14, pp. 155–164. IEEE (2014)
Eitzinger, J., Gruber, T., Afzal, A., Zeiser, T., Wellein, G.: Clustercockpit-a web application for job-specific performance monitoring. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–7. IEEE (2019)
Eller, P.R., Hoefler, T., Gropp, W.: Using performance models to understand scalable krylov solver performance at scale for structured grid problems. In: ICS ’19: Proceedings of the ACM International Conference on Supercomputing, pp. 138–149. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3330345.3330358
Evans, T., et al.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21. IEEE (2014)
Forschungszentrum Jülich, J.S.C.: LLview. https://github.com/FZJ-JSC/LLview. Accessed 30 Apr 2024
Gabriel Jr, D.J.: I/O throughput prediction for HPC applications using darshan logs. Ph.D. thesis, University of Nevada, Reno (2022)
Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurrency Comput. Pract. Experience 22(6), 702–719 (2010)
Grafana (2023). https://github.com/grafana/grafana
Izadpanah, R., Naksinehaboon, N., Brandt, J., Gentile, A., Dechev, D.: Integrating low-latency analysis into HPC system monitoring. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)
Jeannot, E., Pallez, G., Vidal, N.: Scheduling periodic I/O access with bi-colored chains: models and algorithms. J. Sched. 24(5), 469–481 (2021)
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope,Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M.S., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7
Kunkel, J.M., et al.: Tools for analyzing parallel I/O. In: High Performance Computing: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers 33, pp. 49–70. Springer (2018)
Lee, C.W., Malony, A.D., Morris, A.: Taumon: scalable online performance data analysis in tau. In: Euro-Par 2010 Parallel Processing Workshops: HeteroPar, HPCC, HiBB, CoreGrid, UCHPC, HPCF, PROPER, CCPI, VHPC, Ischia, Italy, August 31–September 3, 2010, Revised Selected Papers 16, pp. 493–499. Springer (2011)
Marathe, A., et al.: Performance modeling under resource constraints using deep transfer learning. In: SC ’17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3126908.3126969
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Matsakis, N.D., Klock II, F.S.: The rust language. In: ACM SIGAda Ada Letters, vol. 34, pp. 103–104. ACM (2014)
Morris, A., Spear, W., Malony, A.D., Shende, S.: Observing performance dynamics using parallel profile snapshots. In: Euro-Par 2008–Parallel Processing: 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, August 26–29, 2008. Proceedings 14, pp. 162–171. Springer (2008)
Netti, A., et al.: From facility to application sensor data: modular, continuous and holistic monitoring with DCDB. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–27 (2019)
Netti, A., et al.: DCDB wintermute: enabling online and holistic operational data analytics on HPC systems. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 101–112 (2020)
Nikitenko, D., et al.: Jobdigest–detailed system monitoring-based supercomputer application behavior analysis. In: Russian Supercomputing Days, pp. 516–529. Springer (2017)
Obaida, M.A., Liu, J., Chennupati, G., Santhi, N., Eidenbenz, S.: Parallel application performance prediction using analysis based models and HPC simulations. In: SIGSIM-PADS ’18: Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, pp. 49–59. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3200921.3200937
Price, J., McIntosh-Smith, S.: Improving auto-tuning convergence times with dynamically generated predictive performance models. In: MCSOC ’15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, pp. 211–218. IEEE Computer Society, USA (2015).https://doi.org/10.1109/MCSoC.2015.31
Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 21 (2003)
Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006). https://doi.org/10.1177/1094342006064482
Sodhi, S., Subhlok, J., Xu, Q.: Performance prediction with skeletons. Cluster Comput. 11(2), 151–165 (2008). https://doi.org/10.1007/s10586-007-0039-2
Sun, J., Sun, G., Zhan, S., Zhang, J., Chen, Y.: Automated performance modeling of HPC applications using machine learning. IEEE Trans. Comput. 5, 749–763 (2020). https://doi.org/10.1109/TC.2020.2964767
Tarraf, A., Bandet, A., Boito, F., Pallez, G., Wolf, F.: Capturing periodic I/O using frequency techniques. In: Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, USA, pp. 1–14. IEEE (2024)
Vef, M.A., et al.: Gekkofs - a temporary distributed file system for HPC applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324 (2018). https://doi.org/10.1109/CLUSTER.2018.00049
Weiss, A., Gierczak, O., Patterson, D., Ahmed, A.: Oxide: the essence of rust. arXiv preprint arXiv:1903.00982 (2019)
Yang, W., Liao, X., Dong, D., Yu, J.: A quantitative study of the spatiotemporal I/O burstiness of HPC application. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1349–1359 (2022). https://doi.org/10.1109/IPDPS53621.2022.00133
Acknowledgment
We acknowledge the support of the European Commission and the German Federal Ministry of Education and Research (BMBF) under the EuroHPC Programmes ADMIRE (GA no. 956748, BMBF funding no. 16HPC006K) which receive support from the European Union’s Horizon 2020 program and DE, FR, ES, IT, PL, SE. This work was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project No. 449683531 (ExtraNoise). Moreover, the authors gratefully acknowledge the computing time provided to them on the high-performance computer at the University of Turin from the laboratory on High-Performance Computing for Artificial Intelligence [9].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Besnard, JB., Tarraf, A., Cascajo, A., Shende, S. (2025). Introducing the Metric Proxy for Holistic I/O Measurements. In: Weiland, M., Neuwirth, S., Kruse, C., Weinzierl, T. (eds) High Performance Computing. ISC High Performance 2024 International Workshops. ISC High Performance 2023. Lecture Notes in Computer Science, vol 15058. Springer, Cham. https://doi.org/10.1007/978-3-031-73716-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-73716-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73715-2
Online ISBN: 978-3-031-73716-9
eBook Packages: Computer ScienceComputer Science (R0)