Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3588195.3592986acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Open access

AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis

Published: 07 August 2023 Publication History

Abstract

Manually diagnosing the I/O performance bottleneck for a single application (hereinafter referred to as the "job level'') is a tedious and error-prone procedure requiring domain scientists to have deep knowledge of complex storage systems. However, existing automatic methods for I/O performance bottleneck diagnosis have one major issue: the granularity of the analysis is at the platform or group level and the diagnosis results cannot be applied to the individual application. To address this issue, we designed and developed a method named "Artificial Intelligence for I/O" (AIIO), which uses AI and its interpretation technology to diagnose I/O performance bottlenecks at the job level automatically. By considering the sparsity of I/O log files, employing multiple AI models for performance prediction, merging diagnosis results across multiple models, and generalizing its performance prediction and diagnosis functions, AIIO can accurately and robustly identify the bottleneck of an even unseen application. Experimental results show that real and unseen applications can use the diagnosis results from AIIO to improve their I/O performance by at most 146 times.

References

[1]
Sercan Ö mer Arik and Tomas Pfister. 2019. TabNet: Attentive Interpretable Tabular Learning. CoRR, Vol. abs/1908.07442 (2019). showeprint[arXiv]1908.07442 http://arxiv.org/abs/1908.07442
[2]
Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz. 2007. Stack Trace Analysis for Large Scale Debugging. In IPDPS. 1--10. https://doi.org/10.1109/IPDPS.2007.370254
[3]
Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alex Sim, Suren Byna, Hanul Sung, and Hyeonsang Eom. 2021. An In-Depth I/O Pattern Analysis in HPC Systems. In HiPC. 400--405. https://doi.org/10.1109/HiPC53243.2021.00056
[4]
Jean Luca Bez, Hammad Ather, and Suren Byna. 2022a. Drishti: Guiding End-Users in the I/O Optimization Journey. In 2022 IEEE/ACM International Parallel Data Systems Workshop (PDSW). 1--6. https://doi.org/10.1109/PDSW56643.2022.00006
[5]
Jean Luca Bez, Ahmad Maroof Karimi, Arnab K. Paul, Bing Xie, Suren Byna, Philip Carns, Sarp Oral, Feiyi Wang, and Jesse Hanley. 2022b. Access Patterns and Performance Behaviors of Multi-Layer Supercomputer I/O Subsystems under Production Load. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (Minneapolis, MN, USA) (HPDC '22). Association for Computing Machinery, New York, NY, USA, 43--55. https://doi.org/10.1145/3502181.3531461
[6]
Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, and Suren Byna. 2021. I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis. In 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW). 15--22. https://doi.org/10.1109/PDSW54622.2021.00008
[7]
Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2021. Deep Neural Networks and Tabular Data: A Survey. CoRR, Vol. abs/2110.01889 (2021). showeprint[arXiv]2110.01889 https://arxiv.org/abs/2110.01889
[8]
Suren Byna, M Scot Breitenfeld, Bin Dong, Quincey Koziol, Elena Pourmal, Dana Robinson, Jerome Soumagne, Houjun Tang, Venkatram Vishwanath, and Richard Warren. 2020. Exahdf5: delivering efficient parallel i/o on exascale computing systems. Journal of Computer Science and Technology, Vol. 35, 1 (2020), 145--160.
[9]
Suren Byna, Mohamad Chaarawi, Quincey Koziol, John Mainzer, and Frank Willmore. 2017. Tuning HDF5 subfiling performance on parallel file systems. (5 2017). https://www.osti.gov/biblio/1398484
[10]
Surendra Byna, Jerry Chou, Oliver Rubel, Prabhat, Homa Karimabadi, William S. Daughter, Vadim Roytershteyn, E. Wes Bethel, Mark Howison, Ke-Jou Hsu, Kuan-Wu Lin, Arie Shoshani, Andrew Uselton, and Kesheng Wu. 2012. Parallel I/O, analysis, and visualization of a trillion particle simulation. In SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--12. https://doi.org/10.1109/SC.2012.92
[11]
P Carns, K Harms, R Latham, and R Ross. 2012. Performance analysis of Darshan 2.2. 3 on the Cray XE6 platform. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).
[12]
P. Carns, R. Latham, R. Ross, K. Iskra, S. Lang, and K. Riley. 2009. 24/7 Characterization of petascale I/O workloads. In 2009 IEEE International Conference on Cluster Computing and Workshops (CLUSTER). IEEE Computer Society, Los Alamitos, CA, USA, 1--10. https://doi.org/10.1109/CLUSTR.2009.5289150
[13]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. CoRR, Vol. abs/1603.02754 (2016). showeprint[arXiv]1603.02754 http://arxiv.org/abs/1603.02754
[14]
Emily Costa, Tirthak Patel, Benjamin Schwaller, James Brandt, and Devesh Tiwari. 2021a. Lessons From Examining Repetitive Job Behavior and I/O Performance Variability on a Production HPC System Emily Costa Northeastern University, USA Tirthak Patel Northeastern University, USA Benjamin Schwaller. "OSTI" (8 2021). https://www.osti.gov/biblio/1884199
[15]
Emily Costa, Tirthak Patel, Benjamin Schwaller, Jim M. Brandt, and Devesh Tiwari. 2021b. Systematically Inferring I/O Performance Variability by Examining Repetitive Job Behavior. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 33, 15 pages. https://doi.org/10.1145/3458817.3476186
[16]
Eliakin Del Rosario, Mikaela Currier, Mihailo Isakov, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B Ross, Kevin Harms, Shane Snyder, and Michel A Kinsy. 2020. Gauge: An interactive data-driven visualization tool for HPC application I/O performance analysis. In 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW). IEEE, 15--21.
[17]
Bin Dong, Xiuqiao Li, Limin Xiao, and Li Ruan. 2012. A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access. In 2012 ACM/IEEE 13th International Conference on Grid Computing. 22--30. https://doi.org/10.1109/Grid.2012.11
[18]
Bin Dong, Verónica Rodríguez Tribaldos, Xin Xing, Suren Byna, Jonathan Ajo-Franklin, and Kesheng Wu. 2020. DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 254--263. https://doi.org/10.1109/IPDPS47924.2020.00035
[19]
Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, and Aleksandr Vorobev. 2017. Fighting biases with dynamic boosting. CoRR, Vol. abs/1706.09516 (2017). showeprint[arXiv]1706.09516 http://arxiv.org/abs/1706.09516
[20]
Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, Vol. 29, 5 (2001), 1189 -- 1232. https://doi.org/10.1214/aos/1013203451
[21]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer New York Inc., New York, NY, USA.
[22]
Dean Hildebrand, Arifa Nisar, and Roger Haskin. 2009. pNFS, POSIX, and MPI-IO: a tale of three semantics. In Proceedings of the 4th Annual Workshop on Petascale Data Storage. 32--36.
[23]
Axel Huebl, Rémi Lehe, Jean-Luc Vay, David P. Grote, Ivo F. Sbalzarini, Stephan Kuschel, and Michael Bussmann. 2017. Open Science with openPMD. https://doi.org/10.5281/zenodo.822396
[24]
M. Isakov, M. Currier, E. Rosario, S. Madireddy, P. Balaprakash, P. Carns, R. B. Ross, G. K. Lockwood, and M. A. Kinsy. 2022. A Taxonomy of Error Sources in HPC I/O Machine Learning Models. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (SC). IEEE Computer Society, Los Alamitos, CA, USA, 205--218. https://doi.ieeecomputersociety.org/
[25]
Mihailo Isakov, Eliakin del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, and Michel A. Kinsy. 2020a. Toward Generalizable Models of I/O Throughput. In 2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 41--49. https://doi.org/10.1109/ROSS51935.2020.00010
[26]
Mihailo Isakov, Eliakin del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, and Michel A. Kinsy. 2020b. HPC I/O Throughput Bottleneck Analysis with Explainable Local Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13. https://doi.org/10.1109/SC41405.2020.00037
[27]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
[28]
Edward K Lee and Randy H Katz. 1993. An analytic performance model of disk arrays. In Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 98--109.
[29]
Tonglin Li, Suren Byna, Quincey Koziol, Houjun Tang, Jean Luca Bez, and Qiao Kang. 2021. h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns. In Proceedings of Cray User Group Meeting, CUG 2021.
[30]
Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J. Wright. 2019. A Year in the Life of a Parallel File System. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Article 74, 13 pages. https://doi.org/10.1109/SC.2018.00077
[31]
Jay Lofstead, Milo Polte, Garth Gibson, Scott Klasky, Karsten Schwan, Ron Oldfield, Matthew Wolf, and Qing Liu. 2011. Six Degrees of Scientific Data: Reading Patterns for Extreme Scale Science IO. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (San Jose, California, USA) (HPDC '11). Association for Computing Machinery, New York, NY, USA, 49--60. https://doi.org/10.1145/1996130.1996139
[32]
Scott M. Lundberg, Gabriel G. Erion, Hugh Chen, Alex J. DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2019. Explainable AI for Trees: From Local Explanations to Global Understanding. CoRR, Vol. abs/1905.04610 (2019). showeprint[arXiv]1905.04610 http://arxiv.org/abs/1905.04610
[33]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
[34]
T.M. Madhyastha and D.A. Reed. 2002. Learning to classify parallel input/output access patterns. TPDS, Vol. 13, 8 (2002), 802--813. https://doi.org/10.1109/TPDS.2002.1028437
[35]
N. Nieuwejaar, D. Kotz, A. Purakayastha, C. Sclatter Ellis, and M.L. Best. 1996. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, Vol. 7, 10 (1996), 1075--1089. https://doi.org/10.1109/71.539739
[36]
Arnab K. Paul, Ahmad Maroof Karimi, and Feiyi Wang. 2021. Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1--8. https://doi.org/10.1109/MASCOTS53633.2021.9614303
[37]
Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta numerica, Vol. 8 (1999), 143--195.
[38]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You"": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016. 1135--1144.
[39]
Philip C. Roth. 2007. Characterizing the I/O Behavior of Scientific Applications on the Cray XT. In Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing '07 (Reno, Nevada) (PDSW '07). Association for Computing Machinery, New York, NY, USA, 50--55. https://doi.org/10.1145/1374596.1374609
[40]
Peter Scheuermann, Gerhard Weikum, and Peter Zabback. 1998. Data partitioning and load balancing in parallel disk systems. The VLDB Journal, Vol. 7, 1 (1998), 48--66.
[41]
Seetharami Seelam, I-Hsin Chung, Ding-Yong Hong, Hui-Fang Wen, and Hao Yu. 2008. Early experiences in application level I/O tracing on blue gene systems. In IPDPS. 1--8. https://doi.org/10.1109/IPDPS.2008.4536550
[42]
Sameer S. Shende and Allen D. Malony. 2006. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl., Vol. 20, 2 (may 2006), 287--311. https://doi.org/10.1177/1094342006064482
[43]
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Important Features Through Propagating Activation Differences. CoRR, Vol. abs/1704.02685 (2017). showeprint[arXiv]1704.02685 http://arxiv.org/abs/1704.02685
[44]
Ravid Shwartz-Ziv and Amitai Armon. 2021. Tabular Data: Deep Learning is Not All You Need. CoRR, Vol. abs/2106.03253 (2021). showeprint[arXiv]2106.03253 https://arxiv.org/abs/2106.03253
[45]
Mukund Subramaniyan, Anders Skoogh, Jon Bokrantz, Muhammad Azam Sheikh, Matthias Thürer, and Qing Chang. 2021. Artificial intelligence for throughput bottleneck analysis -- State-of-the-art and future directions. Journal of Manufacturing Systems, Vol. 60 (2021), 734--751. https://doi.org/10.1016/j.jmsy.2021.07.021
[46]
Jeffrey S. Vetter and Michael O. McCracken. 2001. Statistical Scalability Analysis of Communication Operations in Distributed Applications. SIGPLAN Not., Vol. 36, 7 (jun 2001), 123--132. https://doi.org/10.1145/568014.379590
[47]
Feng Wang, Qin Xin, Bo Hong, Scott A Brandt, Ethan L Miller, and Darrell Long. 2004. File system workload analysis for large scale scientific computing applications. (2004).
[48]
Teng Wang, Suren Byna, Glenn K. Lockwood, Shane Snyder, Philip Carns, Sunggon Kim, and Nicholas J. Wright. 2019. A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks. In CCGRID. 102--111. https://doi.org/10.1109/CCGRID.2019.00021
[49]
Teng Wang, Shane Snyder, Glenn Lockwood, Philip Carns, Nicholas Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). 466--476. https://doi.org/10.1109/CLUSTER.2018.00062
[50]
Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, and Feiyi Wang. 2019. Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems. In PDSW. 30--39. https://doi.org/10.1109/PDSW49588.2019.00008
[51]
Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, and Feiyi Wang. 2021. Interpreting Write Performance of Supercomputer I/O Systems with Regression Models. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 557--566. https://doi.org/10.1109/IPDPS49936.2021.00064
[52]
Izzet Yildirim, Hariharan Devarajan, Anthony Kougkas, Xian-He Sun, and Kathryn Mohror. 2022. A Multifaceted Approach to Automated I/O Bottleneck Detection for HPC Workloads. https://sc22.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost186.html

Index Terms

  1. AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
      August 2023
      350 pages
      ISBN:9798400701559
      DOI:10.1145/3588195
      • General Chair:
      • Ali R. Butt,
      • Program Chairs:
      • Ningfang Mi,
      • Kyle Chard
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 August 2023

      Check for updates

      Author Tags

      1. AI interpretation
      2. Darshan
      3. I/O bottleneck
      4. artificial intelligence
      5. diagnosis
      6. job-level
      7. machine learning
      8. prediction

      Qualifiers

      • Research-article

      Funding Sources

      • the U.S. Department of Energy (DOE)
      • Exascale Computing Project

      Conference

      HPDC '23

      Acceptance Rates

      Overall Acceptance Rate 166 of 966 submissions, 17%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 445
        Total Downloads
      • Downloads (Last 12 months)445
      • Downloads (Last 6 weeks)33
      Reflects downloads up to 18 Aug 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media