research-article

Open access

AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis

Authors:

Suren BynaAuthors Info & Claims

HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Pages 155 - 167

https://doi.org/10.1145/3588195.3592986

Published: 07 August 2023 Publication History

Abstract

Manually diagnosing the I/O performance bottleneck for a single application (hereinafter referred to as the "job level'') is a tedious and error-prone procedure requiring domain scientists to have deep knowledge of complex storage systems. However, existing automatic methods for I/O performance bottleneck diagnosis have one major issue: the granularity of the analysis is at the platform or group level and the diagnosis results cannot be applied to the individual application. To address this issue, we designed and developed a method named "Artificial Intelligence for I/O" (AIIO), which uses AI and its interpretation technology to diagnose I/O performance bottlenecks at the job level automatically. By considering the sparsity of I/O log files, employing multiple AI models for performance prediction, merging diagnosis results across multiple models, and generalizing its performance prediction and diagnosis functions, AIIO can accurately and robustly identify the bottleneck of an even unseen application. Experimental results show that real and unseen applications can use the diagnosis results from AIIO to improve their I/O performance by at most 146 times.

References

[1]

Sercan Ö mer Arik and Tomas Pfister. 2019. TabNet: Attentive Interpretable Tabular Learning. CoRR, Vol. abs/1908.07442 (2019). showeprint[arXiv]1908.07442 http://arxiv.org/abs/1908.07442

[2]

Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz. 2007. Stack Trace Analysis for Large Scale Debugging. In IPDPS. 1--10. https://doi.org/10.1109/IPDPS.2007.370254

[3]

Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alex Sim, Suren Byna, Hanul Sung, and Hyeonsang Eom. 2021. An In-Depth I/O Pattern Analysis in HPC Systems. In HiPC. 400--405. https://doi.org/10.1109/HiPC53243.2021.00056

[4]

Jean Luca Bez, Hammad Ather, and Suren Byna. 2022a. Drishti: Guiding End-Users in the I/O Optimization Journey. In 2022 IEEE/ACM International Parallel Data Systems Workshop (PDSW). 1--6. https://doi.org/10.1109/PDSW56643.2022.00006

[5]

Jean Luca Bez, Ahmad Maroof Karimi, Arnab K. Paul, Bing Xie, Suren Byna, Philip Carns, Sarp Oral, Feiyi Wang, and Jesse Hanley. 2022b. Access Patterns and Performance Behaviors of Multi-Layer Supercomputer I/O Subsystems under Production Load. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (Minneapolis, MN, USA) (HPDC '22). Association for Computing Machinery, New York, NY, USA, 43--55. https://doi.org/10.1145/3502181.3531461

Digital Library

[6]

Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, and Suren Byna. 2021. I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis. In 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW). 15--22. https://doi.org/10.1109/PDSW54622.2021.00008

[7]

Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2021. Deep Neural Networks and Tabular Data: A Survey. CoRR, Vol. abs/2110.01889 (2021). showeprint[arXiv]2110.01889 https://arxiv.org/abs/2110.01889

[8]

Suren Byna, M Scot Breitenfeld, Bin Dong, Quincey Koziol, Elena Pourmal, Dana Robinson, Jerome Soumagne, Houjun Tang, Venkatram Vishwanath, and Richard Warren. 2020. Exahdf5: delivering efficient parallel i/o on exascale computing systems. Journal of Computer Science and Technology, Vol. 35, 1 (2020), 145--160.

Digital Library

[9]

Suren Byna, Mohamad Chaarawi, Quincey Koziol, John Mainzer, and Frank Willmore. 2017. Tuning HDF5 subfiling performance on parallel file systems. (5 2017). https://www.osti.gov/biblio/1398484

[10]

Surendra Byna, Jerry Chou, Oliver Rubel, Prabhat, Homa Karimabadi, William S. Daughter, Vadim Roytershteyn, E. Wes Bethel, Mark Howison, Ke-Jou Hsu, Kuan-Wu Lin, Arie Shoshani, Andrew Uselton, and Kesheng Wu. 2012. Parallel I/O, analysis, and visualization of a trillion particle simulation. In SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--12. https://doi.org/10.1109/SC.2012.92

Digital Library

[11]

P Carns, K Harms, R Latham, and R Ross. 2012. Performance analysis of Darshan 2.2. 3 on the Cray XE6 platform. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).

[12]

P. Carns, R. Latham, R. Ross, K. Iskra, S. Lang, and K. Riley. 2009. 24/7 Characterization of petascale I/O workloads. In 2009 IEEE International Conference on Cluster Computing and Workshops (CLUSTER). IEEE Computer Society, Los Alamitos, CA, USA, 1--10. https://doi.org/10.1109/CLUSTR.2009.5289150

[13]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. CoRR, Vol. abs/1603.02754 (2016). showeprint[arXiv]1603.02754 http://arxiv.org/abs/1603.02754

Digital Library

[14]

Emily Costa, Tirthak Patel, Benjamin Schwaller, James Brandt, and Devesh Tiwari. 2021a. Lessons From Examining Repetitive Job Behavior and I/O Performance Variability on a Production HPC System Emily Costa Northeastern University, USA Tirthak Patel Northeastern University, USA Benjamin Schwaller. "OSTI" (8 2021). https://www.osti.gov/biblio/1884199

[15]

Emily Costa, Tirthak Patel, Benjamin Schwaller, Jim M. Brandt, and Devesh Tiwari. 2021b. Systematically Inferring I/O Performance Variability by Examining Repetitive Job Behavior. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 33, 15 pages. https://doi.org/10.1145/3458817.3476186

Digital Library

[16]

Eliakin Del Rosario, Mikaela Currier, Mihailo Isakov, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B Ross, Kevin Harms, Shane Snyder, and Michel A Kinsy. 2020. Gauge: An interactive data-driven visualization tool for HPC application I/O performance analysis. In 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW). IEEE, 15--21.

[17]

Bin Dong, Xiuqiao Li, Limin Xiao, and Li Ruan. 2012. A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access. In 2012 ACM/IEEE 13th International Conference on Grid Computing. 22--30. https://doi.org/10.1109/Grid.2012.11

Digital Library

[18]

Bin Dong, Verónica Rodríguez Tribaldos, Xin Xing, Suren Byna, Jonathan Ajo-Franklin, and Kesheng Wu. 2020. DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 254--263. https://doi.org/10.1109/IPDPS47924.2020.00035

[19]

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, and Aleksandr Vorobev. 2017. Fighting biases with dynamic boosting. CoRR, Vol. abs/1706.09516 (2017). showeprint[arXiv]1706.09516 http://arxiv.org/abs/1706.09516

[20]

Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, Vol. 29, 5 (2001), 1189 -- 1232. https://doi.org/10.1214/aos/1013203451

[21]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer New York Inc., New York, NY, USA.

[22]

Dean Hildebrand, Arifa Nisar, and Roger Haskin. 2009. pNFS, POSIX, and MPI-IO: a tale of three semantics. In Proceedings of the 4th Annual Workshop on Petascale Data Storage. 32--36.

Digital Library

[23]

Axel Huebl, Rémi Lehe, Jean-Luc Vay, David P. Grote, Ivo F. Sbalzarini, Stephan Kuschel, and Michael Bussmann. 2017. Open Science with openPMD. https://doi.org/10.5281/zenodo.822396

[24]

M. Isakov, M. Currier, E. Rosario, S. Madireddy, P. Balaprakash, P. Carns, R. B. Ross, G. K. Lockwood, and M. A. Kinsy. 2022. A Taxonomy of Error Sources in HPC I/O Machine Learning Models. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (SC). IEEE Computer Society, Los Alamitos, CA, USA, 205--218. https://doi.ieeecomputersociety.org/

[25]

Mihailo Isakov, Eliakin del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, and Michel A. Kinsy. 2020a. Toward Generalizable Models of I/O Throughput. In 2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 41--49. https://doi.org/10.1109/ROSS51935.2020.00010

[26]

Mihailo Isakov, Eliakin del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, and Michel A. Kinsy. 2020b. HPC I/O Throughput Bottleneck Analysis with Explainable Local Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13. https://doi.org/10.1109/SC41405.2020.00037

[27]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

[28]

Edward K Lee and Randy H Katz. 1993. An analytic performance model of disk arrays. In Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 98--109.

Digital Library

[29]

Tonglin Li, Suren Byna, Quincey Koziol, Houjun Tang, Jean Luca Bez, and Qiao Kang. 2021. h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns. In Proceedings of Cray User Group Meeting, CUG 2021.

[30]

Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J. Wright. 2019. A Year in the Life of a Parallel File System. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Article 74, 13 pages. https://doi.org/10.1109/SC.2018.00077

Digital Library

[31]

Jay Lofstead, Milo Polte, Garth Gibson, Scott Klasky, Karsten Schwan, Ron Oldfield, Matthew Wolf, and Qing Liu. 2011. Six Degrees of Scientific Data: Reading Patterns for Extreme Scale Science IO. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (San Jose, California, USA) (HPDC '11). Association for Computing Machinery, New York, NY, USA, 49--60. https://doi.org/10.1145/1996130.1996139

Digital Library

[32]

Scott M. Lundberg, Gabriel G. Erion, Hugh Chen, Alex J. DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2019. Explainable AI for Trees: From Local Explanations to Global Understanding. CoRR, Vol. abs/1905.04610 (2019). showeprint[arXiv]1905.04610 http://arxiv.org/abs/1905.04610

[33]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf

[34]

T.M. Madhyastha and D.A. Reed. 2002. Learning to classify parallel input/output access patterns. TPDS, Vol. 13, 8 (2002), 802--813. https://doi.org/10.1109/TPDS.2002.1028437

Digital Library

[35]

N. Nieuwejaar, D. Kotz, A. Purakayastha, C. Sclatter Ellis, and M.L. Best. 1996. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, Vol. 7, 10 (1996), 1075--1089. https://doi.org/10.1109/71.539739

Digital Library

[36]

Arnab K. Paul, Ahmad Maroof Karimi, and Feiyi Wang. 2021. Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1--8. https://doi.org/10.1109/MASCOTS53633.2021.9614303

[37]

Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta numerica, Vol. 8 (1999), 143--195.

[38]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You"": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016. 1135--1144.

Digital Library

[39]

Philip C. Roth. 2007. Characterizing the I/O Behavior of Scientific Applications on the Cray XT. In Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing '07 (Reno, Nevada) (PDSW '07). Association for Computing Machinery, New York, NY, USA, 50--55. https://doi.org/10.1145/1374596.1374609

Digital Library

[40]

Peter Scheuermann, Gerhard Weikum, and Peter Zabback. 1998. Data partitioning and load balancing in parallel disk systems. The VLDB Journal, Vol. 7, 1 (1998), 48--66.

Digital Library

[41]

Seetharami Seelam, I-Hsin Chung, Ding-Yong Hong, Hui-Fang Wen, and Hao Yu. 2008. Early experiences in application level I/O tracing on blue gene systems. In IPDPS. 1--8. https://doi.org/10.1109/IPDPS.2008.4536550

[42]

Sameer S. Shende and Allen D. Malony. 2006. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl., Vol. 20, 2 (may 2006), 287--311. https://doi.org/10.1177/1094342006064482

Digital Library

[43]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Important Features Through Propagating Activation Differences. CoRR, Vol. abs/1704.02685 (2017). showeprint[arXiv]1704.02685 http://arxiv.org/abs/1704.02685

[44]

Ravid Shwartz-Ziv and Amitai Armon. 2021. Tabular Data: Deep Learning is Not All You Need. CoRR, Vol. abs/2106.03253 (2021). showeprint[arXiv]2106.03253 https://arxiv.org/abs/2106.03253

[45]

Mukund Subramaniyan, Anders Skoogh, Jon Bokrantz, Muhammad Azam Sheikh, Matthias Thürer, and Qing Chang. 2021. Artificial intelligence for throughput bottleneck analysis -- State-of-the-art and future directions. Journal of Manufacturing Systems, Vol. 60 (2021), 734--751. https://doi.org/10.1016/j.jmsy.2021.07.021

[46]

Jeffrey S. Vetter and Michael O. McCracken. 2001. Statistical Scalability Analysis of Communication Operations in Distributed Applications. SIGPLAN Not., Vol. 36, 7 (jun 2001), 123--132. https://doi.org/10.1145/568014.379590

Digital Library

[47]

Feng Wang, Qin Xin, Bo Hong, Scott A Brandt, Ethan L Miller, and Darrell Long. 2004. File system workload analysis for large scale scientific computing applications. (2004).

[48]

Teng Wang, Suren Byna, Glenn K. Lockwood, Shane Snyder, Philip Carns, Sunggon Kim, and Nicholas J. Wright. 2019. A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks. In CCGRID. 102--111. https://doi.org/10.1109/CCGRID.2019.00021

[49]

Teng Wang, Shane Snyder, Glenn Lockwood, Philip Carns, Nicholas Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). 466--476. https://doi.org/10.1109/CLUSTER.2018.00062

[50]

Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, and Feiyi Wang. 2019. Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems. In PDSW. 30--39. https://doi.org/10.1109/PDSW49588.2019.00008

[51]

Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, and Feiyi Wang. 2021. Interpreting Write Performance of Supercomputer I/O Systems with Regression Models. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 557--566. https://doi.org/10.1109/IPDPS49936.2021.00064

[52]

Izzet Yildirim, Hariharan Devarajan, Anthony Kougkas, Xian-He Sun, and Kathryn Mohror. 2022. A Multifaceted Approach to Automated I/O Bottleneck Detection for HPC Workloads. https://sc22.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost186.html

Index Terms

AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Causal reasoning and diagnostics
2. Information systems
  1. Information storage systems

Recommendations

Information-Theoretic Syndrome Evaluation, Statistical Root-Cause Analysis, and Correlation-Based Feature Selection for Guiding Board-Level Fault Diagnosis
Reasoning-based functional-fault diagnosis has recently been advocated to achieve high diagnosis accuracy, low defect escapes, and reducing manufacturing cost. However, such diagnosis method requires a rich set of test items (syndromes) and a sizable ...
Application of artificial intelligence techniques for non-alcoholic fatty liver disease diagnosis: A systematic review (2005–2023)
Highlights
- NAFLD is a liver disease that is becoming more common throughout the world.
- This article provides a comprehensive review of the studies conducted in the area of utilizing artificial intelligence to determine the prevalence of NAFLD/...
Abstract Background and objectives
Non-alcoholic fatty liver disease (NAFLD) is a common liver disease with a rapidly growing incidence worldwide. For prognostication and therapeutic decisions, it is important to distinguish the pathological stages of ...
Six application scenarios of artificial intelligence in the precise diagnosis and treatment of liver cancer
Abstract
The establishment of the precision diagnosis and treatment system and the advent of the digital intelligence era have not only deepened people's understanding of liver cancer but also continuously improved the diagnosis and treatment methods of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

August 2023

350 pages

ISBN:9798400701559

DOI:10.1145/3588195

General Chair:
Ali R. Butt
Virginia Tech, USA
,
Program Chairs:
Ningfang Mi
Northeastern University, USA
,
Kyle Chard
University of Chicago & Argonne National Laboratory, USA

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2023

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the U.S. Department of Energy (DOE)
Exascale Computing Project

Conference

HPDC '23

Sponsor:

HPDC '23: The 32nd International Symposium on High-Performance Parallel and Distributed Computing

June 16 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
445
Total Downloads

Downloads (Last 12 months)445
Downloads (Last 6 weeks)33

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents