Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Introducing the Metric Proxy for Holistic I/O Measurements

  • Conference paper
  • First Online:
High Performance Computing. ISC High Performance 2024 International Workshops (ISC High Performance 2023)

Abstract

High-Performance Computing (HPC) systems face a wide spectrum of I/O patterns from various sources including workflows, in-Situ data operations, or from Ad-hoc file storages. However, accurately monitoring these workloads at scale is challenging due to multiple layers’ interference on system performance metrics. The metric proxy addresses this by providing real-time insights into system states, reducing overhead and storage constraints. By utilizing a Tree-Based Overlay Network (TBON) topology, it efficiently collects metrics across nodes in HPC systems. This paper explores the conceptual foundation of the metric proxy, its architecture design, and how it can be used to improve I/O performance modelling and detection of periodic I/O workload patterns, ultimately aiding in more informed system optimization strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/besnardjb/proxy_v2.

  2. 2.

    https://github.com/besnardjb/beegfs-exporter.

  3. 3.

    https://github.com/tuda-parallel/FTIO.

References

  1. Graphite. https://graphiteapp.org/. Accessed 27 Feb 2024

  2. InfluxData. https://www.influxdata.com/. Accessed 27 Feb 2024

  3. Nagios. https://www.nagios.org/. Accessed 27 Feb 2024

  4. OpenTSDB. http://opentsdb.net/. Accessed 27 Feb 2024

  5. Prometheus time series database. https://prometheus.io/. Accessed 27 Feb 2024

  6. Sensu. https://sensu.io/. Accessed 27 Feb 2024

  7. Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Experience 6, 685–701 (2010)

    Article  Google Scholar 

  8. Aggarwal, V., Yoon, C., George, A., Lam, H., Stitt, G.: Performance modeling for multilevel communication in shmem+. In: PGAS ’10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/2020373.2020380

  9. Aldinucci, M., et al.: HPC4AI, an AI-on-demand federated platform endeavour. In: ACM Computing Frontiers. Ischia, Italy (2018). https://doi.org/10.1145/3203217.3205340

  10. Betke, E., Kunkel, J.: Footprinting parallel i/o–machine learning to classify application’s i/o behavior. In: High Performance Computing: ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16–20, 2019, Revised Selected Papers 34, pp. 214–226. Springer (2019)

    Google Scholar 

  11. Bhattacharyya, A., Hoefler, T.: PEMOGEN: automatic adaptive performance modeling during program runtime. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14), pp. 393–404. ACM (2014)

    Google Scholar 

  12. Boehme, D., et al.: Caliper: performance introspection for HPC software stacks. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 550–560. IEEE (2016)

    Google Scholar 

  13. Boito, F., Pallez, G., Teylo, L., Vidal, N.: IO-sets: simple and efficient approaches for I/O bandwidth management. IEEE Trans. Parallel Distrib. Syst. 34(10), 2783–2796 (2023)

    Article  Google Scholar 

  14. Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC13), Denver, CO, USA. pp. 1–12. ACM (2013). https://doi.org/10.1145/2503210.2503277

  15. Carretero, J., et al.: Adaptive multi-tier intelligent data manager for exascale. In: Proceedings of the 20th ACM International Conference on Computing Frontiers, pp. 285–290 (2023)

    Google Scholar 

  16. Cascajo, A., Singh, D.E., Carretero, J.: Limitless-light-weight monitoring tool for large scale systems. Microprocess. Microsyst. 93, 104586 (2022)

    Article  Google Scholar 

  17. Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: IPDPS’14, pp. 155–164. IEEE (2014)

    Google Scholar 

  18. Eitzinger, J., Gruber, T., Afzal, A., Zeiser, T., Wellein, G.: Clustercockpit-a web application for job-specific performance monitoring. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–7. IEEE (2019)

    Google Scholar 

  19. Eller, P.R., Hoefler, T., Gropp, W.: Using performance models to understand scalable krylov solver performance at scale for structured grid problems. In: ICS ’19: Proceedings of the ACM International Conference on Supercomputing, pp. 138–149. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3330345.3330358

  20. Evans, T., et al.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21. IEEE (2014)

    Google Scholar 

  21. Forschungszentrum Jülich, J.S.C.: LLview. https://github.com/FZJ-JSC/LLview. Accessed 30 Apr 2024

  22. Gabriel Jr, D.J.: I/O throughput prediction for HPC applications using darshan logs. Ph.D. thesis, University of Nevada, Reno (2022)

    Google Scholar 

  23. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurrency Comput. Pract. Experience 22(6), 702–719 (2010)

    Article  Google Scholar 

  24. Grafana (2023). https://github.com/grafana/grafana

  25. Izadpanah, R., Naksinehaboon, N., Brandt, J., Gentile, A., Dechev, D.: Integrating low-latency analysis into HPC system monitoring. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)

    Google Scholar 

  26. Jeannot, E., Pallez, G., Vidal, N.: Scheduling periodic I/O access with bi-colored chains: models and algorithms. J. Sched. 24(5), 469–481 (2021)

    Article  MathSciNet  Google Scholar 

  27. Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope,Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M.S., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7

  28. Kunkel, J.M., et al.: Tools for analyzing parallel I/O. In: High Performance Computing: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers 33, pp. 49–70. Springer (2018)

    Google Scholar 

  29. Lee, C.W., Malony, A.D., Morris, A.: Taumon: scalable online performance data analysis in tau. In: Euro-Par 2010 Parallel Processing Workshops: HeteroPar, HPCC, HiBB, CoreGrid, UCHPC, HPCF, PROPER, CCPI, VHPC, Ischia, Italy, August 31–September 3, 2010, Revised Selected Papers 16, pp. 493–499. Springer (2011)

    Google Scholar 

  30. Marathe, A., et al.: Performance modeling under resource constraints using deep transfer learning. In: SC ’17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3126908.3126969

  31. Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  32. Matsakis, N.D., Klock II, F.S.: The rust language. In: ACM SIGAda Ada Letters, vol. 34, pp. 103–104. ACM (2014)

    Google Scholar 

  33. Morris, A., Spear, W., Malony, A.D., Shende, S.: Observing performance dynamics using parallel profile snapshots. In: Euro-Par 2008–Parallel Processing: 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, August 26–29, 2008. Proceedings 14, pp. 162–171. Springer (2008)

    Google Scholar 

  34. Netti, A., et al.: From facility to application sensor data: modular, continuous and holistic monitoring with DCDB. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–27 (2019)

    Google Scholar 

  35. Netti, A., et al.: DCDB wintermute: enabling online and holistic operational data analytics on HPC systems. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 101–112 (2020)

    Google Scholar 

  36. Nikitenko, D., et al.: Jobdigest–detailed system monitoring-based supercomputer application behavior analysis. In: Russian Supercomputing Days, pp. 516–529. Springer (2017)

    Google Scholar 

  37. Obaida, M.A., Liu, J., Chennupati, G., Santhi, N., Eidenbenz, S.: Parallel application performance prediction using analysis based models and HPC simulations. In: SIGSIM-PADS ’18: Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, pp. 49–59. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3200921.3200937

  38. Price, J., McIntosh-Smith, S.: Improving auto-tuning convergence times with dynamically generated predictive performance models. In: MCSOC ’15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, pp. 211–218. IEEE Computer Society, USA (2015).https://doi.org/10.1109/MCSoC.2015.31

  39. Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: a software-based multicast/reduction network for scalable tools. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 21 (2003)

    Google Scholar 

  40. Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006). https://doi.org/10.1177/1094342006064482

    Article  Google Scholar 

  41. Sodhi, S., Subhlok, J., Xu, Q.: Performance prediction with skeletons. Cluster Comput. 11(2), 151–165 (2008). https://doi.org/10.1007/s10586-007-0039-2

    Article  Google Scholar 

  42. Sun, J., Sun, G., Zhan, S., Zhang, J., Chen, Y.: Automated performance modeling of HPC applications using machine learning. IEEE Trans. Comput. 5, 749–763 (2020). https://doi.org/10.1109/TC.2020.2964767

    Article  Google Scholar 

  43. Tarraf, A., Bandet, A., Boito, F., Pallez, G., Wolf, F.: Capturing periodic I/O using frequency techniques. In: Proceedings of the 38th IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, USA, pp. 1–14. IEEE (2024)

    Google Scholar 

  44. Vef, M.A., et al.: Gekkofs - a temporary distributed file system for HPC applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324 (2018). https://doi.org/10.1109/CLUSTER.2018.00049

  45. Weiss, A., Gierczak, O., Patterson, D., Ahmed, A.: Oxide: the essence of rust. arXiv preprint arXiv:1903.00982 (2019)

  46. Yang, W., Liao, X., Dong, D., Yu, J.: A quantitative study of the spatiotemporal I/O burstiness of HPC application. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1349–1359 (2022). https://doi.org/10.1109/IPDPS53621.2022.00133

Download references

Acknowledgment

We acknowledge the support of the European Commission and the German Federal Ministry of Education and Research (BMBF) under the EuroHPC Programmes ADMIRE (GA no. 956748, BMBF funding no. 16HPC006K) which receive support from the European Union’s Horizon 2020 program and DE, FR, ES, IT, PL, SE. This work was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project No. 449683531 (ExtraNoise). Moreover, the authors gratefully acknowledge the computing time provided to them on the high-performance computer at the University of Turin from the laboratory on High-Performance Computing for Artificial Intelligence [9].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jean-Baptiste Besnard .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Besnard, JB., Tarraf, A., Cascajo, A., Shende, S. (2025). Introducing the Metric Proxy for Holistic I/O Measurements. In: Weiland, M., Neuwirth, S., Kruse, C., Weinzierl, T. (eds) High Performance Computing. ISC High Performance 2024 International Workshops. ISC High Performance 2023. Lecture Notes in Computer Science, vol 15058. Springer, Cham. https://doi.org/10.1007/978-3-031-73716-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73716-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73715-2

  • Online ISBN: 978-3-031-73716-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics