research-article

Public Access

Low overhead and context sensitive profiling of GPU-accelerated applications

Authors:

Jonathon Anderson,

John Mellor-CrummeyAuthors Info & Claims

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Article No.: 1, Pages 1 - 13

https://doi.org/10.1145/3524059.3532388

Published: 28 June 2022 Publication History

Abstract

As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265--283.

Digital Library

[2]

Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.

[3]

A. S. Almgren, J. B. Bell, M. J. Lijewski, Z. Lukić, and E. Van Andel. 2013. Nyx: A Massively Parallel AMR Code for Computational Cosmology. The Astrophysical Journal 765, Article 39 (March 2013), 39 pages. arXiv:1301.4498

[4]

AMD Corporation. 2017. ROC-profiler. https://github.com/ROCm-Developer-Tools/rocprofiler. [Accessed May 9, 2022].

[5]

AMD Corporation. 2017. ROC-tracer. https://github.com/ROCm-Developer-Tools/roctracer [Accessed May 9, 2022].

[6]

Glenn Ammons, Thomas Ball, and James R Larus. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. ACM Sigplan Notices 32, 5 (1997), 85--96.

Digital Library

[7]

Thomas Ball and James R Larus. 1996. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 46--57.

[8]

Théophile Bastian, Stephen Kell, and Francesco Zappa Nardelli. 2019. Reliable and fast DWARF-based stack unwinding. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1--24.

Digital Library

[9]

Michael D Bond and Kathryn S McKinley. 2007. Probabilistic calling context. Acm Sigplan Notices 42, 10 (2007), 97--112.

Digital Library

[10]

Milind Chabbi, Karthik Murthy, Michael Fagan, and John Mellor-Crummey. 2013. Effective sampling-driven performance tools for GPU-accelerated supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 43.

Digital Library

[11]

Jack Deslippe et al. 2012. BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures. Computer Physics Communications 183, 6 (2012), 1269--1289.

[12]

Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, and Nicholas Bambos. 2016. Reliable and efficient performance monitoring in linux. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 396--408.

Digital Library

[13]

Veselin A Dobrev, Tzanio V Kolev, and Robert N Rieben. 2012. High-order curvilinear finite element methods for Lagrangian hydrodynamics. SIAM Journal on Scientific Computing 34, 5 (2012), B606--B641.

Digital Library

[14]

Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci, and Dan Terpstra. 2001. Using PAPI for hardware performance monitoring on Linux systems. In Conference on Linux Clusters: The HPC Revolution, Vol. 5. Citeseer.

[15]

H Carter Edwards, Christian R Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202--3216.

Digital Library

[16]

Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019).

[17]

Nathan Froyd, John Mellor-Crummey, and Rob Fowler. 2005. Low-overhead call path profiling of unmodified, optimized code. In Proceedings of the 19th annual international conference on Supercomputing. 81--90.

Digital Library

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[19]

Richard D. Hornung and Jeffrey A. Keasler. 2014. The RAJA Portability Layer: Overview and Status. (Sept. 2014).

[20]

Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020. 2704--2710.

Digital Library

[21]

Intel Corporation. 2022. oneAPI Level Zero. https://spec.oneapi.io/versions/latest/elements/l0/source/index.html. [Accessed May 9, 2022].

[22]

Melanie Kambadur, Sunpyo Hong, Juan Cabral, Harish Patil, Chi-Keung Luk, Sohaib Sajid, and Martha A Kim. 2015. Fast computational GPU design with GT-Pin. In 2015 IEEE International Symposium on Workload Characterization. IEEE, 76--86.

Digital Library

[23]

Lawrence Berkeley National Laboratory, National Renewable Energy Laboratory, and Sandia National Laboratories. 2019. AMR-Wind. https://github.com/Exawind/amr-wind. [Accessed Jan 17, 2022].

[24]

A Myers, A Almgren, LD Amorim, J Bell, L Fedeli, L Ge, Kevin Gott, David P Grote, M Hogan, Axel Huebl, et al. 2021. Porting WarpX to GPU-accelerated platforms. arXiv preprint arXiv:2101.12149 (2021).

[25]

Todd Mytkowicz, Devin Coughlin, and Amer Diwan. 2009. Inferred call path profiling. ACM SIGPLAN Notices 44, 10 (2009), 175--190.

Digital Library

[26]

National Renewable Energy Laboratory. 2019. PeleC. https://github.com/AMReX-Combustion/PeleC. [Accessed Mar 1, 2022].

[27]

NVIDIA Corporation. 2022. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-driver-api/group_CUDA_EXEC.html#group__CUDA__EXEC. [Accessed May 9, 2022].

[28]

NVIDIA Corporation. 2022. CUPTI User's Guide DA-05679-001_v10.1. https://docs.nvidia.com/cuda/pdf/CUPTI_Library.pdf. [Accessed May 5, 2022].

[29]

NVIDIA Corporation. 2022. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute. [Accessed May 7, 2022].

[30]

NVIDIA Corporation. 2022. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems. [Accessed May 17, 2022].

[31]

NVIDIA Corporation. 2022. PC Sampling. https://docs.nvidia.com/cupti/Cupti/r_main.html#r_pc_sampling. [Accessed May 4, 2022].

[32]

NVIDIA Corporation. 2022. The user manual for NVIDIA profiling tools for optimizing performance of CUDA applications. https://docs.nvidia.com/cuda/profiler-users-guide. [Accessed May 9, 2022].

[33]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in neural information processing systems 32 (2019), 8026--8037.

[34]

James C Phillips, David J Hardy, Julio DC Maia, John E Stone, João V Ribeiro, Rafael C Bernardi, Ronak Buch, Giacomo Fiorin, Jérôme Hénin, Wei Jiang, et al. 2020. Scalable molecular dynamics on CPU and GPU architectures with NAMD. The Journal of chemical physics 153, 4 (2020), 044130.

[35]

Steve Plimpton. 1995. Fast parallel algorithms for short-range molecular dynamics. Journal of computational physics 117, 1 (1995), 1--19.

Digital Library

[36]

James Reinders. 2005. VTune performance analyzer essentials. Intel Press.

[37]

Harald Servat, Germán Llort, Juan González, Judit Giménez, and Jesús Labarta. 2016. Bio-inspired call-stack reconstruction for performance analysis. In 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). IEEE, 82--90.

[38]

William N Sumner, Yunhui Zheng, Dasarath Weeratunge, and Xiangyu Zhang. 2011. Precise calling context encoding. IEEE Transactions on Software Engineering 38, 5 (2011), 1160--1177.

Digital Library

[39]

Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th International Conference on Parallel Processing Workshops. IEEE, 207--216.

Digital Library

[40]

Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 372--383.

Digital Library

[41]

Benjamin Welton and Barton P Miller. 2019. Diogenes: looking for an honest CPU/GPU performance measurement tool. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--20.

Digital Library

[42]

H. Zhang and J. Hollingsworth. 2019. Understanding the Performance of GPGPU Applications from a Data-Centric View. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). 1--8.

[43]

Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, et al. 2019. AMReX: a framework for block-structured adaptive mesh refinement. Journal of Open Source Software 4, 37 (2019), 1370--1370.

[44]

Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, and John Mellor-Crummey. 2021. Measurement and analysis of GPU-accelerated applications with HPCToolkit. Parallel Comput. 108 (2021), 102837.

Digital Library

[45]

Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.

[46]

Keren Zhou, Mark W Krentel, and John Mellor-Crummey. 2020. Tools for top-down performance analysis of GPU-accelerated applications. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.

Digital Library

[47]

Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. 2021. GPA: A GPU Performance Advisor Based on Instruction Sampling. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 115--125.

[48]

Xiaotong Zhuang, Mauricio J Serrano, Harold W Cain, and Jong-Deok Choi. 2006. Accurate, efficient, and adaptive calling context profiling. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. 263--271.

Digital Library

Cited By

Li RYadav SWu QKavi KMehta GYadwadkar NJohn L(2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00024

Recommendations

Cross-Accelerator Performance Profiling
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the ...
GPU-Accelerated HMM for Speech Recognition
ICPPW '14: Proceedings of the 2014 43rd International Conference on Parallel Processing Workshops

Speech recognition is used in a wide range of applications and devices such as mobile phones, in-car entertainment systems and web-based services. Hidden Markov Models (HMMs) is one of the most popular algorithmic approaches applied in speech ...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

June 2022

514 pages

ISBN:9781450392815

DOI:10.1145/3524059

General Chairs:
Lawrence Rauchwerger
University of Illinois at Urbana-Champaign
,
Kirk Cameron
Virginia Tech
,
Program Chairs:
Dimitrios S. Nikolopoulos
Virginia Tech
,
Dionisios Pnevmatikatos
National Technical University of Athens

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICS '22

Sponsor:

SIGARCH

ICS '22: 2022 International Conference on Supercomputing

June 28 - 30, 2022

Virtual Event

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)300
Downloads (Last 6 weeks)33

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li RYadav SWu QKavi KMehta GYadwadkar NJohn L(2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00024

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten