research-article

UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis

Authors:

Glenn K. Lockwood,

Nicholas J. Wright,

Philip CarnsAuthors Info & Claims

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

Pages 55 - 60

https://doi.org/10.1145/3149393.3149395

Published: 12 November 2017 Publication History

Abstract

I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface (UMAMI) to provide a normalized, holistic view of I/O behavior.

We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct large-scale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.

References

[1]

A. Adelmann, A. Gsell, B. Oswald, T. Schietinger, W. Bethel, J. M. Shalf, C. Siegerist, and K. Stockinger. 2007. Progress on H5Part: a portable high performance parallel data interface for electromagnetics simulations. In 2007 IEEE Particle Accelerator Conference (PAC). 3396--3398.

[2]

Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K. Lockwood, Vakho Tsulaia, Surendra Byna, Steve Farrell, Doga Gursoy, Chris Daley, Vince Beckner, Brian Van Straalen, David Trebotich, Craig Tull, Gunther Weber, Nicholas J. Wright, Katie Antypas, and Prabhat. 2016. Accelerating Science with the NERSC Burst Buffer Early User Program. In Proceedings of the 2016 Cray User Group. London, https://www.nersc.gov/assets/Uploads/Nersc-BB-EUP-CUG.pdf

[3]

K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J T Kwan. 2008. Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation. Physics of Plasmas 15, 5 (may 2008), 55703.

[4]

Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. 2011. Understanding and improving computational science storage access through continuous characterization. ACM Transactions on Storage (TOS) 7, 3 (2011), 8.

Digital Library

[5]

Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. 2009. 24/7 characterization of petascale I/O workloads. In Proceedings of the IEEE International Conference on Cluster Computing (CIUSTER'09). IEEE, 1--10.

[6]

Matthieu Dorier, Gabriel Antoniu, Rob Ross, Dries Kimpe, and Shadi Ibrahim. 2014. CALCioM: Mitigating I/O interference in HPC systems through cross-application coordination. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 155--164.

Digital Library

[7]

Jim Garlick and Christopher Morrone. 2010. Lustre Monitoring Tools. (2010). https://github.com/LLNL/lmt

[8]

Salman Habib, Vitali A. Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joseph A. Insley, David Daniel, Patricia K. Fasel, Nicholas Frontiere, and Zarija Lukic. 2012. The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q. CoRR abs/1211.4864 (2012). http://arxiv.org/abs/1211.4864

[9]

Dave Henseler, Benjamin Landsteiner, Doug Petesch, Cornell Wright, and Nicholas J Wright. 2016. Architecture and Design of Cray DataWarp. In Proceedings of the 2016 Cray User Group. London, https://cug.org/proceedings/cug2016_proceedings/includes/files/pap105.pdf

[10]

Julian M. Kunkel, Michaela Zimmer, Nathanael Hübbe, Alvaro Aguilera, Holger Mickler, Xuan Wang, Andriy Chut, Thomas Bönisch, Jakob Lüttgau, Roman Michel, and Johann Weging. 2014. The SIOX Architecture --- Coupling Automatic Monitoring and Optimization of Parallel I/O. In Proceedings of the 29th International Conference on Supercomputing - Volume 8488 (ISC 2014). Springer-Verlag New York, Inc., New York, NY, USA, 245--260.

Digital Library

[11]

Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S Vazhkudai. 2016. Server-side Log Data Analytics for I/O Workload Characterization and Coordination on Large Shared Storage Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16). IEEE Press, 70:1--70:11.

Digital Library

[12]

Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, and Matthew Wolf. 2010. Managing Variability in the IO Performance of Petascale Storage Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE, 1--12.

Digital Library

[13]

Sarp Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, Matt Ezell, Ross Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, et al. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 217--228.

Digital Library

[14]

Md. Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, and Pradeep Dubey. 2015. BD-CATS: Big Data Clustering at Trillion Particle Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15). 6:1--6:12.

Digital Library

[15]

Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular HPC I/O characterization with Darshan. In Proceedings of the 5th Workshop on Extreme-Scale Programming Tools. IEEE Press, 9--17.

Digital Library

[16]

Andrew Uselton, Mark Howison, Nicholas J. Wright, David Skinner, Noel Keen, John Shalf, Karen L. Karavanic, and Leonid Oliker. 2010. Parallel I/O performance: From events to ensembles. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS'10). IEEE, 1--11.

[17]

Bing Xie, Jeffrey Chase, David Dillow, Oleg Drokin, Scott Klasky, Sarp Oral, and Norbert Podhorszki. 2012. Characterizing output bottlenecks in a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE, 1--11.

Digital Library

[18]

Orcun Yildiz, Matthieu Dorier, Shadi Ibrahim, Rob Ross, and Gabriel Antoniu. 2016. On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 750--759.

Cited By

Egersdoerfer CSareen ABez JByna SDai D(2024)ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665950(86-92)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665950
Paul ANeuwirth SWadhwa BWang FOral SButt A(2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1145/3641885
Xian GYang WTan YFeng JLi YZhang JYu J(2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
https://doi.org/10.1016/j.parco.2024.103095
Show More Cited By

Recommendations

ICCPS '15: Proceedings of the ACM/IEEE Sixth International Conference on Cyber-Physical Systems
Specifying and Modeling Railway Cyber Physical Systems by the Extension of AADL
CSE '13: Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering

Advances in computer technology and computer technology has enabled new generation railway cyber physical systems, where computing units are interacting with the physical environment not only through monitoring and decision making in the computing ...
Cyber Physical Socio Ecology

Exploring the laws of the nature and the rules of human society is the grand challenge of sciences. The Internet, Web, various communication networks and digital devices are connecting each other to form an enormous cyber space. The cyber space ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

November 2017

74 pages

ISBN:9781450351348

DOI:10.1145/3149393

Program Chairs:
Kathryn Mohror
Lawrence Livermore National Laboratory
,
Brent Welch
Google

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy, Office of Science

Conference

SC '17

Sponsor:

SIGHPC
IEEE-CS\DATC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 17 of 41 submissions, 41%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
281
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Egersdoerfer CSareen ABez JByna SDai D(2024)ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665950(86-92)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665950
Paul ANeuwirth SWadhwa BWang FOral SButt A(2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1145/3641885
Xian GYang WTan YFeng JLi YZhang JYu J(2024)Mobilizing underutilized storage nodes via job path: A job-aware file striping approachParallel Computing10.1016/j.parco.2024.103095(103095)Online publication date: Aug-2024
https://doi.org/10.1016/j.parco.2024.103095
Yildirim IDevarajan HKougkas ASun XMohror K(2023)IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC SystemsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624191(1209-1215)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624191
Boito FBrandt JCardellini VCarns PCiorba FEgan HEleliemy AGentile AGruber THanson JHaus UHuck KIlsche TJakobsche TJones TKarlsson SMueen AOtt MPatki TPeng IRaghavan KSimms SShoga KShowerman MTiwari DWilde TYamamoto K(2023)Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)10.1109/CLUSTERWorkshops61457.2023.00016(37-43)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTERWorkshops61457.2023.00016
Liu ZZhang CWu HFang JPeng LYe GTang Z(2023)Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00027(234-246)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00027
Ather HBez JNorris BByna S(2023)Illuminating the I/O Optimization Path of Scientific ApplicationsHigh Performance Computing10.1007/978-3-031-32041-5_2(22-41)Online publication date: 10-May-2023
https://doi.org/10.1007/978-3-031-32041-5_2
Nicolas LThomas LHadjadj-Aoul YBoukhobza JKuhn MDuwe KAcquaviva JChasapis KBoukhobza J(2022)SLRLProceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems10.1145/3503646.3524297(33-39)Online publication date: 5-Apr-2022
https://dl.acm.org/doi/10.1145/3503646.3524297
Isakov MCurrier Mdel Rosario EMadireddy SBalaprakash PCarns PRoss RLockwood GKinsy M(2022)A Taxonomy of Error Sources in HPC I/O Machine Learning ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00021(01-14)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00021
Bez JAther HByna S(2022)Drishti: Guiding End-Users in the I/O Optimization Journey2022 IEEE/ACM International Parallel Data Systems Workshop (PDSW)10.1109/PDSW56643.2022.00006(1-6)Online publication date: Nov-2022
https://doi.org/10.1109/PDSW56643.2022.00006
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents