research-article

ScalAna: automating scaling loss detection with graph analysis

Authors:

Xiongchao Tang,

Torsten Hoefler,

Jidong ZhaiAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 28, Pages 1 - 14

Published: 09 November 2020 Publication History

Abstract

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads.

In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.

References

[1]

"top500 website," 2020. [Online]. Available: http://top500.org/

[2]

J. Y. Shi, M. Taifi, A. Pradeep, A. Khreishah, and V. Antony, "Program scalability analysis for hpc cloud: Applying amdahl's law to nas benchmarks," in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 2012, pp. 1215--1225.

[3]

X. Liu and B. Wu, "Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2015, p. 47.

[4]

O. Pearce, H. Ahmed, R. W. Larsen, P. Pirkelbauer, and D. F. Richards, "Exploring dynamic load imbalance solutions with the comd proxy application," Future Generation Computer Systems, vol. 92, pp. 920--932, 2019.

[5]

D. Schmidl, M. S. Müller, and C. Bischof, "Openmp scalability limits on large smps and how to extend them," Fachgruppe Informatik, Tech. Rep., 2016.

[6]

D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow, The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center, 1995.

[7]

M. Geimer, F. Wolf, B. J. Wylie, E. Ábrahám, D. Becker, and B. Mohr, "The scalasca performance toolset architecture," Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 702--719, 2010.

Digital Library

[8]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, "Hpctoolkit: Tools for performance analysis of optimized parallel programs," Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 685--701, 2010.

[9]

J. Vetter and C. Chambreau, "mpip: Lightweight, scalable mpi profiling," 2005.

[10]

N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey, "Scalable identification of load imbalance in parallel executions using call path profiles," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2010, pp. 1--11.

[11]

N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, and M. Krentel, "Diagnosing performance bottlenecks in emerging petascale applications," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 2009, pp. 1--11.

[12]

"Intel trace analyzer and collector." [Online]. Available: https://software.intel.com/en-us/trace-analyzer

[13]

J. Zhai, W. Chen, and W. Zheng, "Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node," in ACM Sigplan Notices, vol. 45, no. 5. ACM, 2010, pp. 305--314.

[14]

J. C. Linford, S. Khuvis, S. Shende, A. Malony, N. Imam, and M. G. Venkata, "Performance analysis of openshmem applications with tau commander," in Workshop on OpenSHMEM and Related Technologies. Springer, 2017, pp. 161--179.

[15]

H. Yin, Z. Hu, X. Zhou, H. Wang, K. Zheng, Q. V. H. Nguyen, and S. Sadiq, "Discovering interpretable geo-social communities for user behavior prediction," in 2016 IEEE 32nd International Conference on Data Engineering. IEEE, 2016, pp. 942--953.

[16]

H. Yin, B. Cui, X. Zhou, W. Wang, Z. Huang, and S. Sadiq, "Joint modeling of user check-in behaviors for real-time point-of-interest recommendation," ACM Transactions on Information Systems, vol. 35, no. 2, p. 11, 2016.

Digital Library

[17]

A. Bhattacharyya, G. Kwasniewski, and T. Hoefler, "Using Compiler Techniques to Improve Automatic Performance Modeling." ACM, Oct. 2015, in proceedings of the 24th International Conference on Parallel Architectures and Compilation.

[18]

A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf, "Using automated performance modeling to find scalability bugs in complex codes," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013, p. 45.

[19]

F. Wolf, C. Bischof, A. Calotoiu, T. Hoefler, C. Iwainsky, G. Kwasniewski, B. Mohr, S. Shudler, A. Strube, A. Vogel et al., "Automatic performance modeling of hpc applications," in Software for Exascale Computing-SPPEXA 2013--2015. Springer, 2016, pp. 445--465.

[20]

D. Beckingsale, O. Pearce, I. Laguna, and T. Gamblin, "Apollo: Reusable models for fast, dynamic tuning of input-dependent code," in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2017, pp. 307--316.

[21]

J. C. Linford, J. Michalakes, M. Vachharajani, and A. Sandu, "Multi-core acceleration of chemical kinetics for simulation and prediction," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 1--11.

[22]

A. Calotoiu, D. Beckinsale, C. W. Earl, T. Hoefler, I. Karlin, M. Schulz, and F. Wolf, "Fast multi-parameter performance modeling," in 2016 IEEE International Conference on Cluster Computing, Sep. 2016, pp. 172--181.

[23]

"The LLVM compiler framework." [Online]. Available: http://llvm.org

[24]

W. E. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach, "Vampir: Visualization and analysis of mpi resources," 1996.

[25]

"PAPI tools." [Online]. Available: http://icl.utk.edu/papi/software/

[26]

X. Wu and F. Mueller, "Scalaextrap: Trace-based communication extrapolation for spmd programs," in ACM SIGPLAN Notices, vol. 46, no. 8. ACM, 2011, pp. 113--122.

[27]

M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. De Supinski, "Scalatrace: Scalable compression and replay of communication traces for high-performance computing," Journal of Parallel and Distributed Computing, vol. 69, no. 8, pp. 696--710, 2009.

Digital Library

[28]

J. Vetter, "Dynamic statistical profiling of communication activity in distributed applications," ACM SIGMETRICS Performance Evaluation Review, vol. 30, no. 1, pp. 240--250, 2002.

Digital Library

[29]

B. Mohr, PMPI Tools. Boston, MA: Springer US, 2011, pp. 1570--1575. [Online]. Available

[30]

B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. De Supinski, and M. Schulz, "A regression-based approach to scalability prediction," in Proceedings of the 22nd annual international conference on Super-computing. ACM, 2008, pp. 368--377.

[31]

X. Tang, J. Zhai, X. Qian, B. He, W. Xue, and W. Chen, "vsensor: leveraging fixed-workload snippets of programs for performance variance detection," in ACM SIGPLAN Notices, vol. 53, no. 1. ACM, 2018, pp. 124--136.

[32]

D. Terpstra, H. Jagode, H. You, and J. Dongarra, "Collecting performance data with papi-c," in Tools for High Performance Computing 2009. Springer, 2010, pp. 157--173.

[33]

J. C. Hayes, M. L. Norman, R. A. Fiedler, J. O. Bordner, P. S. Li, S. E. Clark, M.-M. Mac Low et al., "Simulating radiating and magnetized flows in multiple dimensions with zeus-mp," The Astrophysical Journal Supplement Series, vol. 165, no. 1, p. 188, 2006.

[34]

A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis et al., "The structural simulation toolkit," ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 37--42, 2011.

Digital Library

[35]

P. F. Fischer, J. W. Lottes, and S. G. Kerkemeier, "nek5000 web page," 2008.

[36]

B. Mohr, "Scalable parallel performance measurement and analysis tools-state-of-the-art and future challenges," Supercomputing frontiers and innovations, vol. 1, no. 2, pp. 108--123, 2014.

[37]

M. Knobloch and B. Mohr, "Tools for gpu computing-debugging and performance analysis of heterogenous hpc applications," Supercomputing Frontiers and Innovations, vol. 7, no. 1, pp. 91--111, 2020.

[38]

D. an Mey, S. Biersdorf, C. Bischof, K. Diethelm, D. Eschweiler, M. Gerndt, A. Knüpfer, D. Lorenz, A. Malony, W. E. Nagel et al., "Score-p: A unified performance measurement system for petascale applications," in Competence in High Performance Computing 2010. Springer, 2011, pp. 85--97.

[39]

"Score-p homepage. score-p consortium." [Online]. Available: http://www.score-p.org

[40]

S. S. Shende and A. D. Malony, "The tau parallel performance system," The International Journal of High Performance Computing Applications, vol. 20, no. 2, pp. 287--311, 2006.

Digital Library

[41]

"Tau homepage. university of oregon." [Online]. Available: http://tau.uoregon.edu

[42]

M. S. Müller, A. Knüpfer, M. Jurenz, M. Lieber, H. Brunst, H. Mix, and W. E. Nagel, "Developing scalable applications with vampir, vampirserver and vampirtrace." in PARCO, vol. 15. Citeseer, 2007, pp. 637--644.

[43]

A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel, "The vampir performance analysis toolset," in Tools for high performance computing. Springer, 2008, pp. 139--155.

[44]

"Vampir homepage. technical university dresden." [Online]. Available: http://www.vampir.eu

[45]

"Scalasca homepage. julich supercomputing centre and german research school for simulation sciences." [Online]. Available: http://www.scalasca.org

[46]

J. Labarta, S. Girona, V. Pillet, T. Cortes, and L. Gregoris, "Dip: A parallel program development environment," in European Conference on Parallel Processing. Springer, 1996, pp. 665--674.

[47]

H. Servat, G. Llort, J. Giménez, and J. Labarta, "Detailed performance analysis using coarse grain sampling," in European Conference on Parallel Processing. Springer, 2009, pp. 185--198.

[48]

"Paraver homepage. barcelona supercomputing center." [Online]. Available: http://www.bsc.es/paraver

[49]

D. Becker, F. Wolf, W. Frings, M. Geimer, B. J. Wylie, and B. Mohr, "Automatic trace-based performance analysis of metacomputing applications," in 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2007, pp. 1--10.

[50]

J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen, "Cypress: combining static and dynamic analysis for top-down communication trace compression," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 143--153.

[51]

M. Noeth, F. Mueller, M. Schulz, and B. R. De Supinski, "Scalable compression and replay of communication traces in massively parallel environments," in 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2007, pp. 1--11.

[52]

S. Krishnamoorthy and K. Agarwal, "Scalable communication trace compression," in Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer Society, 2010, pp. 408--417.

[53]

A. Knupfer and W. E. Nagel, "Construction and compression of complete call graphs for post-mortem program trace analysis," in 2005 International Conference on Parallel Processing. IEEE, 2005, pp. 165--172.

[54]

D. C. Arnold, D. H. Ahn, B. R. De Supinski, G. L. Lee, B. P. Miller, and M. Schulz, "Stack trace analysis for large scale debugging," in 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2007, pp. 1--10.

[55]

C. January, J. Byrd, X. Oró, and M. O'Connor, "Allinea map: Adding energy and openmp profiling without increasing overhead," in Tools for High Performance Computing 2014. Springer, 2015, pp. 25--35.

[56]

S. Kaufmann and B. Homer, "Craypat-cray x1 performance analysis tool," Cray User Group (May 2003), 2003.

[57]

H. Wang, J. Zhai, X. Tang, B. Yu, X. Ma, and W. Chen, "Spindle: informed memory access monitoring," in 2018 Annual Technical Conference, 2018, pp. 561--574.

[58]

M. Weber, R. Brendel, T. Hilbrich, K. Mohror, M. Schulz, and H. Brunst, "Structural clustering: a new approach to support performance analysis at scale," in 2016 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2016, pp. 484--493.

[59]

I. Laguna, D. H. Ahn, B. R. de Supinski, T. Gamblin, G. L. Lee, M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen et al., "Debugging high-performance computing applications at massive scales," Communications of the ACM, vol. 58, no. 9, pp. 72--81, 2015.

Digital Library

[60]

B. Zhou, M. Kulkarni, and S. Bagchi, "Vrisha: using scaling properties of parallel programs for bug detection and localization," in Proceedings of the 20th international symposium on High performance distributed computing. ACM, 2011, pp. 85--96.

[61]

I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, and B. Rountree, "Large scale debugging of parallel tasks with automaded," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 50.

[62]

S. Mitra, I. Laguna, D. H. Ahn, S. Bagchi, M. Schulz, and T. Gamblin, "Accurate application progress analysis for large-scale parallel debugging," in ACM SIGPLAN Notices, vol. 49, no. 6. ACM, 2014, pp. 193--203.

[63]

C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko, "Scalability analysis of spmd codes using expectations," in Proceedings of the 21st annual international conference on Supercomputing. ACM, 2007, pp. 13--22.

[64]

D. Bohme, M. Geimer, F. Wolf, and L. Arnold, "Identifying the root causes of wait states in large-scale parallel applications," in 2010 39th International Conference on Parallel Processing. IEEE, 2010, pp. 90--100.

[65]

J. Chen and R. M. Clapp, "Critical-path candidates: Scalable performance modeling for mpi workloads," in 2015 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 2015, pp. 1--10.

[66]

T. Yu, P. Petoumenos, V. Janjic, H. Leather, and J. Thomson, "Colab: a collaborative multi-factor scheduler for asymmetric multicore processors," in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020, pp. 268--279.

ScalAna: automating scaling loss detection with graph analysis

Recommendations

The impact of bursty traffic on FPCF packet switch performance

This paper analyses and compares the performance of forward planning conflict-free (FPCF), virtual output queuing-partitioned (VOQ-P) and virtual output queuing-shared (VOQ-S) packet switches. The influence of packet burst size, offered switch load and ...
Computation and visualization of cause-effect paths
AST '13: Proceedings of the 8th International Workshop on Automation of Software Test

Static analyzers detect possible run-time errors at compile-time and often employ data-flow analysis techniques to infer properties of programs. Usually, dataflow analysis tools report possible errors with line numbers in source code and leave the task ...
SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
155
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents