Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3149393.3149395acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis

Published: 12 November 2017 Publication History

Abstract

I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface (UMAMI) to provide a normalized, holistic view of I/O behavior.
We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct large-scale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.

References

[1]
A. Adelmann, A. Gsell, B. Oswald, T. Schietinger, W. Bethel, J. M. Shalf, C. Siegerist, and K. Stockinger. 2007. Progress on H5Part: a portable high performance parallel data interface for electromagnetics simulations. In 2007 IEEE Particle Accelerator Conference (PAC). 3396--3398.
[2]
Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K. Lockwood, Vakho Tsulaia, Surendra Byna, Steve Farrell, Doga Gursoy, Chris Daley, Vince Beckner, Brian Van Straalen, David Trebotich, Craig Tull, Gunther Weber, Nicholas J. Wright, Katie Antypas, and Prabhat. 2016. Accelerating Science with the NERSC Burst Buffer Early User Program. In Proceedings of the 2016 Cray User Group. London, https://www.nersc.gov/assets/Uploads/Nersc-BB-EUP-CUG.pdf
[3]
K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J T Kwan. 2008. Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation. Physics of Plasmas 15, 5 (may 2008), 55703.
[4]
Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. 2011. Understanding and improving computational science storage access through continuous characterization. ACM Transactions on Storage (TOS) 7, 3 (2011), 8.
[5]
Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. 2009. 24/7 characterization of petascale I/O workloads. In Proceedings of the IEEE International Conference on Cluster Computing (CIUSTER'09). IEEE, 1--10.
[6]
Matthieu Dorier, Gabriel Antoniu, Rob Ross, Dries Kimpe, and Shadi Ibrahim. 2014. CALCioM: Mitigating I/O interference in HPC systems through cross-application coordination. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 155--164.
[7]
Jim Garlick and Christopher Morrone. 2010. Lustre Monitoring Tools. (2010). https://github.com/LLNL/lmt
[8]
Salman Habib, Vitali A. Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joseph A. Insley, David Daniel, Patricia K. Fasel, Nicholas Frontiere, and Zarija Lukic. 2012. The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q. CoRR abs/1211.4864 (2012). http://arxiv.org/abs/1211.4864
[9]
Dave Henseler, Benjamin Landsteiner, Doug Petesch, Cornell Wright, and Nicholas J Wright. 2016. Architecture and Design of Cray DataWarp. In Proceedings of the 2016 Cray User Group. London, https://cug.org/proceedings/cug2016_proceedings/includes/files/pap105.pdf
[10]
Julian M. Kunkel, Michaela Zimmer, Nathanael Hübbe, Alvaro Aguilera, Holger Mickler, Xuan Wang, Andriy Chut, Thomas Bönisch, Jakob Lüttgau, Roman Michel, and Johann Weging. 2014. The SIOX Architecture --- Coupling Automatic Monitoring and Optimization of Parallel I/O. In Proceedings of the 29th International Conference on Supercomputing - Volume 8488 (ISC 2014). Springer-Verlag New York, Inc., New York, NY, USA, 245--260.
[11]
Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S Vazhkudai. 2016. Server-side Log Data Analytics for I/O Workload Characterization and Coordination on Large Shared Storage Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16). IEEE Press, 70:1--70:11.
[12]
Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, and Matthew Wolf. 2010. Managing Variability in the IO Performance of Petascale Storage Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE, 1--12.
[13]
Sarp Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, Matt Ezell, Ross Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, et al. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 217--228.
[14]
Md. Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, and Pradeep Dubey. 2015. BD-CATS: Big Data Clustering at Trillion Particle Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15). 6:1--6:12.
[15]
Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular HPC I/O characterization with Darshan. In Proceedings of the 5th Workshop on Extreme-Scale Programming Tools. IEEE Press, 9--17.
[16]
Andrew Uselton, Mark Howison, Nicholas J. Wright, David Skinner, Noel Keen, John Shalf, Karen L. Karavanic, and Leonid Oliker. 2010. Parallel I/O performance: From events to ensembles. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS'10). IEEE, 1--11.
[17]
Bing Xie, Jeffrey Chase, David Dillow, Oleg Drokin, Scott Klasky, Sarp Oral, and Norbert Podhorszki. 2012. Characterizing output bottlenecks in a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE, 1--11.
[18]
Orcun Yildiz, Matthieu Dorier, Shadi Ibrahim, Rob Ross, and Gabriel Antoniu. 2016. On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 750--759.

Cited By

View all
  • (2024)ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665950(86-92)Online publication date: 8-Jul-2024
  • (2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 1-Feb-2024
  • (2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems
November 2017
74 pages
ISBN:9781450351348
DOI:10.1145/3149393
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • U.S. Department of Energy, Office of Science

Conference

SC '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 41 submissions, 41%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665950(86-92)Online publication date: 8-Jul-2024
  • (2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 1-Feb-2024
  • (2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
  • (2023)IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC SystemsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624191(1209-1215)Online publication date: 12-Nov-2023
  • (2023)Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)10.1109/CLUSTERWorkshops61457.2023.00016(37-43)Online publication date: 31-Oct-2023
  • (2023)Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00027(234-246)Online publication date: 31-Oct-2023
  • (2023)Illuminating the I/O Optimization Path of Scientific ApplicationsHigh Performance Computing10.1007/978-3-031-32041-5_2(22-41)Online publication date: 10-May-2023
  • (2022)SLRLProceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems10.1145/3503646.3524297(33-39)Online publication date: 5-Apr-2022
  • (2022)A Taxonomy of Error Sources in HPC I/O Machine Learning ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00021(01-14)Online publication date: Nov-2022
  • (2022)Drishti: Guiding End-Users in the I/O Optimization Journey2022 IEEE/ACM International Parallel Data Systems Workshop (PDSW)10.1109/PDSW56643.2022.00006(1-6)Online publication date: Nov-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media