research-article

Open access

Vapro: performance variance detection and diagnosis for production-run parallel applications

Authors:

Xiongchao Tang,

Shuaiwen Leon Song, and

Wenguang ChenAuthors Info & Claims

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2022

Pages 150 - 162

https://doi.org/10.1145/3503221.3508411

Published: 28 March 2022 Publication History

Abstract

Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications' behavior hard to understand. Therefore, detecting and diagnosing performance variance are of crucial importance for users and application developers. However, previous detection approaches either bring too large overhead and hurt applications' performance, or rely on nontrivial source code analysis that is impractical for production-run parallel applications.

In this work, we propose Vapro, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce State Transition Graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, Vapro leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of Vapro is only 1.38% on average. Vapro can detect the variance in real applications caused by hardware bugs, memory, and IO. After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared with the state-of-the-art variance detection tool based on source code analysis, Vapro achieves 30.0% higher detection coverage.

References

[1]

[n.d.]. The cuBERT framework. https://github.com/zhihu/cuBERT.

[2]

[n.d.]. The MapReduce framework. https://github.com/sysprog21/mapreduce.

[3]

[n.d.]. The Nekbone program. https://github.com/Nek5000/Nekbone.

[4]

[n.d.]. The parallel PageRank program. https://github.com/nikos912000/parallel-pagerank.

[5]

[n.d.]. stress. https://packages.debian.org/buster/stress

[6]

Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. [n.d.]. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc.

[7]

Dorian C Arnold, Dong H Ahn, BR De Supinski, Gregory Lee, BP Miller, and Martin Schulz. 2007. Stack trace analysis for large scale applications. In 21st IEEE International Parallel & Distributed Processing Symposium (IPDPS'07), Long Beach, CA.

[8]

Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI'12). 307--320.

[9]

D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. 1995. The NAS Parallel Benchmarks 2.0. NAS Systems Division, NASA Ames Research Center, Moffett Field, CA.

[10]

Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. 2011. Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference. 242--253.

Digital Library

[11]

Pete Beckman, Kamil Iskra, Kazutomo Yoshii, and Susan Coghlan. 2006. The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--12.

[12]

P. Berkhin. 2006. A Survey of Clustering Data Mining Techniques. In Grouping Multidimensional Data: Recent Advances in Clustering, Jacob Kogan, Charles Nicholas, and Marc Teboulle (Eds.). Springer.

[13]

Andrew R Bernat and Barton P Miller. 2011. Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools. 9--16.

Digital Library

[14]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT'08). 72--81.

Digital Library

[15]

Michael J Brim, Luiz DeRose, Barton P Miller, Ramya Olichandran, and Philip C Roth. 2010. MRNet: A scalable infrastructure for the development of parallel tools and applications. Cray User Group (2010).

[16]

Ting Dai, Daniel Dean, Peipei Wang, Xiaohui Gu, and Shan Lu. 2018. Hytrace: a hybrid approach to performance bug diagnosis in production cloud infrastructures. IEEE Transactions on Parallel and Distributed Systems (TPDS) 30, 1 (2018), 107--118.

Digital Library

[17]

Arnaldo Carvalho De Melo. 2010. The new linux perf tools. In Slides from Linux Kongress, Vol. 18. 1--42.

[18]

Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th international conference on Autonomic computing. 191--200.

Digital Library

[19]

Daniel J Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. 2014. Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'14). 1--13.

Digital Library

[20]

Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803--820.

[21]

Donald E Farrar and Robert R Glauber. 1967. Multicollinearity in regression analysis: the problem revisited. The Review of Economic and Statistics (1967), 92--107.

[22]

Kurt B Ferreira, Patrick G Bridges, Ron Brightwell, and Kevin T Pedretti. 2013. The impact of system design parameters on application noise sensitivity. 2010 IEEE International Conference on Cluster Computing 16, 1 (2013), 117--129.

Digital Library

[23]

Yifan Gong, Bingsheng He, and Dan Li. 2014. Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 982--993.

Digital Library

[24]

Jonathan Goodman, Albert G Greenberg, Neal Madras, and Peter March. 1988. Stability of binary exponential backoff. Journal of the ACM (JACM) 35, 3 (1988), 579--602.

Digital Library

[25]

Haryadi S Gunawi, Riza O Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, et al. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14, 3 (2018), 23.

[26]

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). 1--11.

Digital Library

[27]

Mary Inaba, Naoki Katoh, and Hiroshi Imai. 1994. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the tenth annual symposium on Computational geometry. 332--339.

Digital Library

[28]

Intel. 2018. Addressing Potential DGEMM/HPL Perf Variability on 24-Core Intel Xeon Processor Scalable Family. White paper, number 606269, revision 1.0.

[29]

TR Jones, LB Brenner, and JM Fier. 2003. Impacts of operating systems on the scalability of parallel applications. Lawrence Livermore National Laboratory, Tech. Rep. UCRL-MI-202629 (2003).

[30]

JE Kay, C Deser, A Phillips, A Mai, C Hannay, G Strand, JM Arblaster, SC Bates, G Danabasoglu, J Edwards, et al. 2015. The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society 96, 8 (2015), 1333--1349.

[31]

Ignacio Laguna, Dong H Ahn, Bronis R de Supinski, Saurabh Bagchi, and Todd Gamblin. 2015. Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference. IEEE Transactions on Parallel and Distributed Systems (TPDS) 26, 5 (2015), 1280--1289.

Digital Library

[32]

Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. Taming performance variability. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18). 409--425.

[33]

John. McCalpin. 2018. Memory Bandwidth: STREAM Benchmark Performance Results. https://www.cs.virginia.edu/stream/

[34]

John D McCalpin. 2018. HPL and DGEMM performance variability on the Xeon Platinum 8160 processor. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 225--237.

Digital Library

[35]

Subrata Mitra, Ignacio Laguna, Dong H Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. 2014. Accurate application progress analysis for large-scale parallel debugging. In ACM SIGPLAN Notices (PLDI'14), Vol. 49. ACM, 193--203.

[36]

Oscar H Mondragon, Patrick G Bridges, Scott Levy, Kurt B Ferreira, and Patrick Widener. 2016. Understanding performance interference in next-generation HPC systems. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 384--395.

Digital Library

[37]

Biswaranjan Panda, Deepthi Srinivasan, Huan Ke, Karan Gupta, Vinayak Khot, and Haryadi S Gunawi. 2019. IASO: a fail-slow detection and mitigation framework for distributed storage services. In 2019 USENIX Annual Technical Conference (USENIX ATC'19). 47--62.

[38]

Fabrizio Petrini, Darren J Kerbyson, and Scott Pakin. 2003. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In SC'03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing. IEEE, 55--55.

Digital Library

[39]

Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). 410--420.

[40]

Swarup Kumar Sahoo, John Criswell, Chase Geigle, and Vikram Adve. 2013. Using likely invariants for automated software fault localization. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems. 139--152.

Digital Library

[41]

Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proceedings of the VLDB Endowment 3, 1--2 (2010), 460--471.

Digital Library

[42]

Malte Schwarzkopf, Derek G Murray, and Steven Hand. 2012. The seven deadly sins of cloud computing research. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud'12).

[43]

Aamer Shah, Matthias Müller, and Felix Wolf. 2018. Estimating the impact of external interference on application performance. In European Conference on Parallel Processing. Springer, 46--58.

Digital Library

[44]

Timothy Sherwood, Erez Perelman, and Brad Calder. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT'01). IEEE, 3--14.

[45]

Timothy Sherwood, Suleyman Sair, and Brad Calder. 2003. Phase tracking and prediction. In ACM SIGARCH Computer Architecture News, Vol. 31. ACM, 336--349.

[46]

Alexandras Stamatakis. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 21 (2006), 2688--2690.

Digital Library

[47]

Pengfei Su, Shuyin Jiao, Milind Chabbi, and Xu Liu. 2019. Pin-pointing performance inefficiencies via lightweight variance profiling. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--19.

Digital Library

[48]

Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, Wei Xue, and Wenguang Chen. 2018. vSensor: leveraging fixed-workload snippets of programs for performance variance detection. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP'18). 124--136.

[49]

Jeffrey Vetter. 2002. Dynamic statistical profiling of communication activity in distributed applications. ACM SIGMETRICS Performance Evaluation Review 30, 1 (2002), 240--250.

Digital Library

[50]

Jeffrey Vetter and Chris Chambreau. 2005. mpip: Lightweight, scalable mpi profiling. (2005).

[51]

Vincent M Weaver, Dan Terpstra, and Shirley Moore. 2013. Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'13). IEEE, 215--224.

[52]

Brian J. N. Wylie, Markus Geimer, and Felix Wolf. 2008. Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific programming 16, 2--3 (April 2008), 167--181.

[53]

Ulrike Meier Yang et al. 2002. BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41, 1 (2002), 155--177.

Digital Library

[54]

Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'14). IEEE, 35--44.

[55]

Teng Yu, Wenlai Zhao, Pan Liu, Vladimir Janjic, Xiaohan Yan, Shicai Wang, Haohuan Fu, Guangwen Yang, and John Thomson. 2019. Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer. IEEE Transactions on Parallel and Distributed Systems (TPDS) (2019).

[56]

Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, and Wenguang Chen. 2014. Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 143--153.

Digital Library

[57]

Weihua Zhang, Xiaofeng Ji, Bo Song, Shiqiang Yu, Haibo Chen, Tao Li, Pen-Chung Yew, and Wenyun Zhao. 2016. Varcatcher: A framework for tackling performance variability of parallel workloads on multi-core. IEEE Transactions on Parallel and Distributed Systems (TPDS) 28, 4 (2016), 1215--1228.

Digital Library

Cited By

Fan KKesavan SPetruzza SKumar S(2024)TinyProf: Towards Continuous Performance Introspection through Scalable Parallel I/OISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528932(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528932
Deng SZhao HHuang BZhang CChen FDeng YYin JDustdar SZomaya A(2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
https://doi.org/10.1109/JPROC.2024.3353855
Salimi Beni MHunold SCosenza B(2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1007/s11227-024-06040-w
Show More Cited By

Index Terms

Vapro: performance variance detection and diagnosis for production-run parallel applications
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

vSensor: leveraging fixed-workload snippets of programs for performance variance detection
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer ...
Read More
vSensor: leveraging fixed-workload snippets of programs for performance variance detection
PPoPP '18

Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer ...
Read More
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

While software reliability in large-scale systems becomes increasingly important, debugging in large-scale parallel systems remains a daunting task. This paper proposes an innovative technique to find hard-to-detect software bugs that can cause severe ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2022

495 pages

ISBN:9781450392044

DOI:10.1145/3503221

General Chair:
Jaejin Lee
Seoul National University
,
Program Chairs:
Kunal Agrawal
Washington University
,
Michael Spear
Lehigh University

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2022

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ShuiMu Tsinghua Scholar fellowship
Tsinghua University Initiative Scientific Research Program
National Key R&D Program of China
National Natural Science Foundation of China
University of Sydney faculty startup funding
SOAR fellowship
Australia Research Council (ARC) Discovery Project
Beijing Natural Science Foundation
China Postdoctoral Science Foundation

Conference

PPoPP '22

Sponsor:

PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2 - 6, 2022

Seoul, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
692
Total Downloads

Downloads (Last 12 months)300
Downloads (Last 6 weeks)44

Other Metrics

View Author Metrics

Citations

Cited By

Fan KKesavan SPetruzza SKumar S(2024)TinyProf: Towards Continuous Performance Introspection through Scalable Parallel I/OISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528932(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528932
Deng SZhao HHuang BZhang CChen FDeng YYin JDustdar SZomaya A(2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
https://doi.org/10.1109/JPROC.2024.3353855
Salimi Beni MHunold SCosenza B(2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1007/s11227-024-06040-w
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Production-Run Noise DetectionPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_8(199-224)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_8
Zhai JZheng LZhang FTang XWang HYu TJin YSong SChen W(2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3181799

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents