Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503221.3508405acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

PerFlow: a domain specific framework for automatic performance analysis of parallel applications

Published: 28 March 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required.
    To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow's built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.

    References

    [1]
    2021. PAPI tools. http://icl.utk.edu/papi/software/
    [2]
    2021. Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver
    [3]
    2021. Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org
    [4]
    2021. Score-P homepage. Score-P Consortium. http://www.score-p.org
    [5]
    2021. TAU homepage. University of Oregon. http://tau.uoregon.edu
    [6]
    2021. Vampir homepage. Technical University Dresden. http://www.vampir.eu
    [7]
    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI'16). 265--283.
    [8]
    Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.
    [9]
    Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints (2016), arXiv-1605.
    [10]
    Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fangfang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS'17). IEEE, 535--544.
    [11]
    Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.
    [12]
    Dorian C Arnold, Dong H Ahn, Bronis R De Supinski, Gregory L Lee, Barton P Miller, and Martin Schulz. 2007. Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). IEEE, 1--10.
    [13]
    Large-scale Atomic and Molecular Massively Parallel Simulator. 2013. Lammps. available at: http:/lammps.sandia.gov (2013).
    [14]
    D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. 1995. The NAS Parallel Benchmarks 2.0. NAS Systems Division, NASA Ames Research Center, Moffett Field, CA.
    [15]
    Daniel Becker, Felix Wolf, Wolfgang Frings, Markus Geimer, Brian JN Wylie, and Bernd Mohr. 2007. Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). IEEE, 1--10.
    [16]
    Tal Ben-Nun, Johannes de Fine Licht, Alexandras N Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--14.
    [17]
    Arnamoy Bhattacharyya and Torsten Hoefler. 2014. Pemogen: Automatic adaptive performance modeling during program runtime. In Proceedings of the 23rd international conference on Parallel architectures and compilation (PACT'14). 393--404.
    [18]
    David Bohme, Markus Geimer, Felix Wolf, and Lukas Arnold. 2010. Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing (ICPP'10). IEEE, 90--100.
    [19]
    David Böhme, Felix Wolf, Bronis R de Supinski, Martin Schulz, and Markus Geimer. 2012. Scalable critical-path based performance analysis. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS'12). IEEE, 1330--1340.
    [20]
    D. Bohme, F. Wolf, and M. Geimer. 2012. Characterizing Load and Communication Imbalance in Large-Scale Parallel Applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW'12). 2538--2541.
    [21]
    Nader Boushehrinejadmoradi, Adarsh Yoga, and Santosh Nagarakatte. 2018. A parallelism profiler with what-if analyses for openmp programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'18). IEEE, 198--211.
    [22]
    Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R de Supinski, Dong H Ahn, and Martin Schulz. 2010. AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN'10). IEEE, 231--240.
    [23]
    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18). 578--594.
    [24]
    Gabor Csardi, Tamas Nepusz, et al. 2006. The igraph software package for complex network research. (2006).
    [25]
    David E Culler. 1986. Dataflow architectures. Annual review of computer science 1, 1 (1986), 225--253.
    [26]
    Chris Cummins, Zacharias V Fisches, Tal Ben-Nun, Torsten Hoefler, and Hugh Leather. 2020. Programl: Graph-based deep learning for program optimization and analysis. arXiv preprint arXiv:2003.10536 (2020).
    [27]
    Charlie Curtsinger and Emery D Berger. 2015. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP'15). 184--197.
    [28]
    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
    [29]
    Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker. 2007. X-trace: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI'07).
    [30]
    T. Gamblin, B.R. de Supinski, M. Schulz, R. Fowler, and D.A. Reed. 2008. Scalable load-balance measurement for SPMD codes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08). 1--12.
    [31]
    Markus Geimer, Felix Wolf, Brian JN Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22, 6 (2010), 702--719.
    [32]
    Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Hao Lu, Daniel Chavarria-Miranda, Arif Khan, and Assefaw Gebremedhin. 2018. Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS'18). IEEE, 885--895.
    [33]
    Susan L Graham, Peter B Kessler, and Marshall K McKusick. 1982. Gprof: A call graph execution profiler. ACM Sigplan Notices 17, 6 (1982), 120--126.
    [34]
    Zhenyu Guo, Dong Zhou, Haoxiang Lin, Mao Yang, Fan Long, Chaoqiang Deng, Changshu Liu, and Lidong Zhou. 2011. G2: a graph processing system for diagnosing distributed systems. In Proceedings of the 2011 USENIX Conference on Annual Technical Conference (USENIX ATC'11). 27--27.
    [35]
    John C Hayes, Michael L Norman, Robert A Fiedler, James O Bordner, Pak Shing Li, Stephen E Clark, Mordecai-Mark Mac Low, et al. 2006. Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series 165, 1 (2006), 188.
    [36]
    Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gürsoy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. Memxct: Memory-centric x-ray ct reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--56.
    [37]
    Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, and Xipeng Shen. 2021. Understanding and bridging the gaps in current GNN performance optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'21). 119--132.
    [38]
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.
    [39]
    Christopher January, Jonathan Byrd, Xavier Oró, and Mark O'Connor. 2015. Allinea MAP: Adding Energy and OpenMP Profiling Without Increasing Overhead. In Tools for High Performance Computing 2014. Springer, 25--35.
    [40]
    Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19). 47--62.
    [41]
    Yuyang Jin, Haojie Wang, Teng Yu, Xiongchao Tang, Torsten Hoefler, Xu Liu, and Jidong Zhai. 2020. ScalAna: automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20). 1--14.
    [42]
    Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, et al. 2017. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP'17). 34--50.
    [43]
    Steve Kaufmann and Bill Homer. 2003. Craypat-cray x1 performance analysis tool. Cray User Group (May 2003) (2003).
    [44]
    Andreas Knüpfer, Holger Brunst, Jens Doleschal, Matthias Jurenz, Matthias Lieber, Holger Mickler, Matthias S Müller, and Wolfgang E Nagel. 2008. The vampir performance analysis tool-set. In Tools for high performance computing. Springer, 139--155.
    [45]
    Jesús Labarta, Sergi Girona, Vincent Pillet, Toni Cortes, and Luis Gregoris. 1996. DiP: A parallel program development environment. In European Conference on Parallel Processing. Springer, 665--674.
    [46]
    Sunwoo Lee, Dipendra Jha, Ankit Agrawal, Alok Choudhary, and Weikeng Liao. 2017. Parallel deep convolutional neural network training by exploiting the overlapping of computation and communication. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC'17). IEEE, 183--192.
    [47]
    Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. 79--84.
    [48]
    Wolfgang E Nagel, Alfred Arnold, Michael Weber, Hans-Christian Hoppe, and Karl Solchenbach. 1996. VAMPIR: Visualization and analysis of MPI resources. (1996).
    [49]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS'19) 32 (2019), 8026--8037.
    [50]
    Fabrizio Petrini, Darren J. Kerbyson, and Scott Pakin. 2003. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (SC'03). ACM.
    [51]
    Kiran Ravikumar, David Appelhans, and PK Yeung. 2019. GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--22.
    [52]
    James Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).
    [53]
    Baruch Schieber and Uzi Vishkin. 1988. On finding lowest common ancestors: Simplification and parallelization. SIAM J. Comput. 17, 6 (1988), 1253--1262.
    [54]
    Felix Schmitt, Robert Dietrich, and Guido Juckeland. 2017. Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications. The International Journal of High Performance Computing Applications 31, 6 (2017), 485--498.
    [55]
    Harald Servat, Germán Llort, Judit Giménez, and Jesús Labarta. 2009. Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing. Springer, 185--198.
    [56]
    Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287--311.
    [57]
    Tianhui Shi, Mingshu Zhai, Yi Xu, and Jidong Zhai. 2020. GraphPi: high performance graph pattern matching through effective redundancy elimination. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20). IEEE, 1--14.
    [58]
    Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).
    [59]
    Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. 2010. Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). Washington, DC, USA, 1--11.
    [60]
    Nathan R Tallent, John M Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'10). 269--280.
    [61]
    NVIDIA TensorRT. 2019. Programmable inference accelerator. Retrieved August 1 (2019).
    [62]
    Jeffrey Vetter and Chris Chambreau. 2005. mpip: Lightweight, scalable mpi profiling. (2005).
    [63]
    Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI'21). 37--54.
    [64]
    Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and Wenguang Chen. 2018. Spindle: informed memory access monitoring. In Proceedings of the 2018 USENIX Conference on Annual Technical Conference (USENIX ATC'18). 561--574.
    [65]
    Lai Wei and John Mellor-Crummey. 2020. Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20). 144--159.
    [66]
    William R Williams, Xiaozhu Meng, Benjamin Welton, and Barton P Miller. 2016. Dyninst and MRNet: Foundational infrastructure for parallel tools. In Tools for High Performance Computing 2015. Springer, 1--16.
    [67]
    Hisashi Yashiro, Masaaki Terai, Ryuji Yoshida, Shin-ichi Iga, Kazuo Minami, and Hirofumi Tomita. 2016. Performance analysis and optimization of nonhydrostatic icosahedral atmospheric model (NICAM) on the K computer and TSUBAME2. 5. In Proceedings of the Platform for Advanced Scientific Computing Conference. 1--8.
    [68]
    Tingting Yu and Michael Pradel. 2016. Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 389--400.
    [69]
    Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, and Wenguang Chen. 2014. Cypress: combining static and dynamic analysis for top-down communication trace compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'14). IEEE, 143--153.
    [70]
    Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX conference on Annual Technical Conference (USENIX ATC'17). 181--193.
    [71]
    Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19). 131--146.
    [72]
    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20). 863--879.
    [73]
    Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. 2018. wPerf: generic Off-CPU analysis to identify bottleneck waiting events. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18). 527--543.

    Cited By

    View all
    • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
    • (2023)Domain-Specific Framework for Performance AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_9(227-254)Online publication date: 19-Jun-2023
    • (2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    April 2022
    495 pages
    ISBN:9781450392044
    DOI:10.1145/3503221
    This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 March 2022

    Check for updates

    Badges

    Author Tags

    1. dataflow graph
    2. domain specific framework
    3. performance analysis

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Beijing Natural Science Foundation
    • National Key R&D Program of China

    Conference

    PPoPP '22

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)367
    • Downloads (Last 6 weeks)46

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
    • (2023)Domain-Specific Framework for Performance AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_9(227-254)Online publication date: 19-Jun-2023
    • (2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media