research-article

Open access

PerFlow: a domain specific framework for automatic performance analysis of parallel applications

Authors:

Chen Zhang, and

Jidong ZhaiAuthors Info & Claims

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2022

Pages 177 - 191

https://doi.org/10.1145/3503221.3508405

Published: 28 March 2022 Publication History

Abstract

Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required.

To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow's built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.

References

[1]

2021. PAPI tools. http://icl.utk.edu/papi/software/

[2]

2021. Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver

[3]

2021. Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org

[4]

2021. Score-P homepage. Score-P Consortium. http://www.score-p.org

[5]

2021. TAU homepage. University of Oregon. http://tau.uoregon.edu

[6]

2021. Vampir homepage. Technical University Dresden. http://www.vampir.eu

[7]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI'16). 265--283.

[8]

Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.

[9]

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints (2016), arXiv-1605.

[10]

Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fangfang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS'17). IEEE, 535--544.

[11]

Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.

Digital Library

[12]

Dorian C Arnold, Dong H Ahn, Bronis R De Supinski, Gregory L Lee, Barton P Miller, and Martin Schulz. 2007. Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). IEEE, 1--10.

[13]

Large-scale Atomic and Molecular Massively Parallel Simulator. 2013. Lammps. available at: http:/lammps.sandia.gov (2013).

[14]

D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. 1995. The NAS Parallel Benchmarks 2.0. NAS Systems Division, NASA Ames Research Center, Moffett Field, CA.

[15]

Daniel Becker, Felix Wolf, Wolfgang Frings, Markus Geimer, Brian JN Wylie, and Bernd Mohr. 2007. Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). IEEE, 1--10.

[16]

Tal Ben-Nun, Johannes de Fine Licht, Alexandras N Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--14.

Digital Library

[17]

Arnamoy Bhattacharyya and Torsten Hoefler. 2014. Pemogen: Automatic adaptive performance modeling during program runtime. In Proceedings of the 23rd international conference on Parallel architectures and compilation (PACT'14). 393--404.

Digital Library

[18]

David Bohme, Markus Geimer, Felix Wolf, and Lukas Arnold. 2010. Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing (ICPP'10). IEEE, 90--100.

Digital Library

[19]

David Böhme, Felix Wolf, Bronis R de Supinski, Martin Schulz, and Markus Geimer. 2012. Scalable critical-path based performance analysis. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS'12). IEEE, 1330--1340.

Digital Library

[20]

D. Bohme, F. Wolf, and M. Geimer. 2012. Characterizing Load and Communication Imbalance in Large-Scale Parallel Applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW'12). 2538--2541.

[21]

Nader Boushehrinejadmoradi, Adarsh Yoga, and Santosh Nagarakatte. 2018. A parallelism profiler with what-if analyses for openmp programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'18). IEEE, 198--211.

Digital Library

[22]

Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R de Supinski, Dong H Ahn, and Martin Schulz. 2010. AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN'10). IEEE, 231--240.

[23]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18). 578--594.

[24]

Gabor Csardi, Tamas Nepusz, et al. 2006. The igraph software package for complex network research. (2006).

[25]

David E Culler. 1986. Dataflow architectures. Annual review of computer science 1, 1 (1986), 225--253.

[26]

Chris Cummins, Zacharias V Fisches, Tal Ben-Nun, Torsten Hoefler, and Hugh Leather. 2020. Programl: Graph-based deep learning for program optimization and analysis. arXiv preprint arXiv:2003.10536 (2020).

[27]

Charlie Curtsinger and Emery D Berger. 2015. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP'15). 184--197.

Digital Library

[28]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[29]

Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker. 2007. X-trace: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI'07).

[30]

T. Gamblin, B.R. de Supinski, M. Schulz, R. Fowler, and D.A. Reed. 2008. Scalable load-balance measurement for SPMD codes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08). 1--12.

[31]

Markus Geimer, Felix Wolf, Brian JN Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22, 6 (2010), 702--719.

[32]

Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Hao Lu, Daniel Chavarria-Miranda, Arif Khan, and Assefaw Gebremedhin. 2018. Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS'18). IEEE, 885--895.

[33]

Susan L Graham, Peter B Kessler, and Marshall K McKusick. 1982. Gprof: A call graph execution profiler. ACM Sigplan Notices 17, 6 (1982), 120--126.

Digital Library

[34]

Zhenyu Guo, Dong Zhou, Haoxiang Lin, Mao Yang, Fan Long, Chaoqiang Deng, Changshu Liu, and Lidong Zhou. 2011. G2: a graph processing system for diagnosing distributed systems. In Proceedings of the 2011 USENIX Conference on Annual Technical Conference (USENIX ATC'11). 27--27.

[35]

John C Hayes, Michael L Norman, Robert A Fiedler, James O Bordner, Pak Shing Li, Stephen E Clark, Mordecai-Mark Mac Low, et al. 2006. Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series 165, 1 (2006), 188.

[36]

Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gürsoy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. Memxct: Memory-centric x-ray ct reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--56.

Digital Library

[37]

Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, and Xipeng Shen. 2021. Understanding and bridging the gaps in current GNN performance optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'21). 119--132.

Digital Library

[38]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.

Digital Library

[39]

Christopher January, Jonathan Byrd, Xavier Oró, and Mark O'Connor. 2015. Allinea MAP: Adding Energy and OpenMP Profiling Without Increasing Overhead. In Tools for High Performance Computing 2014. Springer, 25--35.

[40]

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19). 47--62.

Digital Library

[41]

Yuyang Jin, Haojie Wang, Teng Yu, Xiongchao Tang, Torsten Hoefler, Xu Liu, and Jidong Zhai. 2020. ScalAna: automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20). 1--14.

[42]

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, et al. 2017. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP'17). 34--50.

Digital Library

[43]

Steve Kaufmann and Bill Homer. 2003. Craypat-cray x1 performance analysis tool. Cray User Group (May 2003) (2003).

[44]

Andreas Knüpfer, Holger Brunst, Jens Doleschal, Matthias Jurenz, Matthias Lieber, Holger Mickler, Matthias S Müller, and Wolfgang E Nagel. 2008. The vampir performance analysis tool-set. In Tools for high performance computing. Springer, 139--155.

[45]

Jesús Labarta, Sergi Girona, Vincent Pillet, Toni Cortes, and Luis Gregoris. 1996. DiP: A parallel program development environment. In European Conference on Parallel Processing. Springer, 665--674.

[46]

Sunwoo Lee, Dipendra Jha, Ankit Agrawal, Alok Choudhary, and Weikeng Liao. 2017. Parallel deep convolutional neural network training by exploiting the overlapping of computation and communication. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC'17). IEEE, 183--192.

[47]

Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. 79--84.

Digital Library

[48]

Wolfgang E Nagel, Alfred Arnold, Michael Weber, Hans-Christian Hoppe, and Karl Solchenbach. 1996. VAMPIR: Visualization and analysis of MPI resources. (1996).

[49]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS'19) 32 (2019), 8026--8037.

[50]

Fabrizio Petrini, Darren J. Kerbyson, and Scott Pakin. 2003. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (SC'03). ACM.

Digital Library

[51]

Kiran Ravikumar, David Appelhans, and PK Yeung. 2019. GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19). 1--22.

Digital Library

[52]

James Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).

[53]

Baruch Schieber and Uzi Vishkin. 1988. On finding lowest common ancestors: Simplification and parallelization. SIAM J. Comput. 17, 6 (1988), 1253--1262.

Digital Library

[54]

Felix Schmitt, Robert Dietrich, and Guido Juckeland. 2017. Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications. The International Journal of High Performance Computing Applications 31, 6 (2017), 485--498.

Digital Library

[55]

Harald Servat, Germán Llort, Judit Giménez, and Jesús Labarta. 2009. Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing. Springer, 185--198.

[56]

Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287--311.

Digital Library

[57]

Tianhui Shi, Mingshu Zhai, Yi Xu, and Jidong Zhai. 2020. GraphPi: high performance graph pattern matching through effective redundancy elimination. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20). IEEE, 1--14.

[58]

Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).

[59]

Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. 2010. Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). Washington, DC, USA, 1--11.

[60]

Nathan R Tallent, John M Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'10). 269--280.

Digital Library

[61]

NVIDIA TensorRT. 2019. Programmable inference accelerator. Retrieved August 1 (2019).

[62]

Jeffrey Vetter and Chris Chambreau. 2005. mpip: Lightweight, scalable mpi profiling. (2005).

[63]

Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI'21). 37--54.

[64]

Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and Wenguang Chen. 2018. Spindle: informed memory access monitoring. In Proceedings of the 2018 USENIX Conference on Annual Technical Conference (USENIX ATC'18). 561--574.

[65]

Lai Wei and John Mellor-Crummey. 2020. Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20). 144--159.

Digital Library

[66]

William R Williams, Xiaozhu Meng, Benjamin Welton, and Barton P Miller. 2016. Dyninst and MRNet: Foundational infrastructure for parallel tools. In Tools for High Performance Computing 2015. Springer, 1--16.

[67]

Hisashi Yashiro, Masaaki Terai, Ryuji Yoshida, Shin-ichi Iga, Kazuo Minami, and Hirofumi Tomita. 2016. Performance analysis and optimization of nonhydrostatic icosahedral atmospheric model (NICAM) on the K computer and TSUBAME2. 5. In Proceedings of the Platform for Advanced Scientific Computing Conference. 1--8.

Digital Library

[68]

Tingting Yu and Michael Pradel. 2016. Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 389--400.

Digital Library

[69]

Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, and Wenguang Chen. 2014. Cypress: combining static and dynamic analysis for top-down communication trace compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'14). IEEE, 143--153.

Digital Library

[70]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX conference on Annual Technical Conference (USENIX ATC'17). 181--193.

[71]

Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19). 131--146.

Digital Library

[72]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20). 863--879.

[73]

Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. 2018. wPerf: generic Off-CPU analysis to identify bottleneck waiting events. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18). 527--543.

Cited By

You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Domain-Specific Framework for Performance AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_9(227-254)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_9
Zhai JZheng LZhang FTang XWang HYu TJin YSong SChen W(2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3181799

Index Terms

PerFlow: a domain specific framework for automatic performance analysis of parallel applications
1. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Program analysis

Recommendations

Specification and detection of performance problems with ASL: Research Articles
European–American Working Group on Automatic Performance Analysis (APART)

Performance analysis is an important step in tuning performance-critical applications. It is a cyclic process of measuring and analyzing performance data, driven by the programmer's hypotheses on potential performance problems. Currently this process is ...
Read More
Performance Analysis for QoS Provisioning in MPLS Networks

In Multiprotocol Label Switching (MPLS) networks, Label Switching Routers (LSRs) cannot only transmit IP packets fast with cut-through mechanism, but also solve traffic engineering issue. In this paper, we will consider the case where real time and non-...
Read More
A study on framework for effective R&D performance analysis of Korea using the Bayesian network and pairwise comparison of AHP

To effectively evaluate and analyze R&D performance, it is necessary to measure the relative importance of performance analysis factors and quantitative analysis methods that consider the objectivity and relevance of detail factors that constitute ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2022

495 pages

ISBN:9781450392044

DOI:10.1145/3503221

General Chair:
Jaejin Lee
Seoul National University
,
Program Chairs:
Kunal Agrawal
Washington University
,
Michael Spear
Lehigh University

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2022

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Beijing Natural Science Foundation
National Key R&D Program of China

Conference

PPoPP '22

Sponsor:

PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2 - 6, 2022

Seoul, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
807
Total Downloads

Downloads (Last 12 months)367
Downloads (Last 6 weeks)46

Other Metrics

View Author Metrics

Citations

Cited By

You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Domain-Specific Framework for Performance AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_9(227-254)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_9
Zhai JZheng LZhang FTang XWang HYu TJin YSong SChen W(2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3181799

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents