Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524059.3532388acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Low overhead and context sensitive profiling of GPU-accelerated applications

Published: 28 June 2022 Publication History

Abstract

As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265--283.
[2]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.
[3]
A. S. Almgren, J. B. Bell, M. J. Lijewski, Z. Lukić, and E. Van Andel. 2013. Nyx: A Massively Parallel AMR Code for Computational Cosmology. The Astrophysical Journal 765, Article 39 (March 2013), 39 pages. arXiv:1301.4498
[4]
AMD Corporation. 2017. ROC-profiler. https://github.com/ROCm-Developer-Tools/rocprofiler. [Accessed May 9, 2022].
[5]
AMD Corporation. 2017. ROC-tracer. https://github.com/ROCm-Developer-Tools/roctracer [Accessed May 9, 2022].
[6]
Glenn Ammons, Thomas Ball, and James R Larus. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. ACM Sigplan Notices 32, 5 (1997), 85--96.
[7]
Thomas Ball and James R Larus. 1996. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. IEEE, 46--57.
[8]
Théophile Bastian, Stephen Kell, and Francesco Zappa Nardelli. 2019. Reliable and fast DWARF-based stack unwinding. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1--24.
[9]
Michael D Bond and Kathryn S McKinley. 2007. Probabilistic calling context. Acm Sigplan Notices 42, 10 (2007), 97--112.
[10]
Milind Chabbi, Karthik Murthy, Michael Fagan, and John Mellor-Crummey. 2013. Effective sampling-driven performance tools for GPU-accelerated supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 43.
[11]
Jack Deslippe et al. 2012. BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures. Computer Physics Communications 183, 6 (2012), 1269--1289.
[12]
Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, and Nicholas Bambos. 2016. Reliable and efficient performance monitoring in linux. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 396--408.
[13]
Veselin A Dobrev, Tzanio V Kolev, and Robert N Rieben. 2012. High-order curvilinear finite element methods for Lagrangian hydrodynamics. SIAM Journal on Scientific Computing 34, 5 (2012), B606--B641.
[14]
Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci, and Dan Terpstra. 2001. Using PAPI for hardware performance monitoring on Linux systems. In Conference on Linux Clusters: The HPC Revolution, Vol. 5. Citeseer.
[15]
H Carter Edwards, Christian R Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202--3216.
[16]
Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019).
[17]
Nathan Froyd, John Mellor-Crummey, and Rob Fowler. 2005. Low-overhead call path profiling of unmodified, optimized code. In Proceedings of the 19th annual international conference on Supercomputing. 81--90.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[19]
Richard D. Hornung and Jeffrey A. Keasler. 2014. The RAJA Portability Layer: Overview and Status. (Sept. 2014).
[20]
Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020. 2704--2710.
[21]
Intel Corporation. 2022. oneAPI Level Zero. https://spec.oneapi.io/versions/latest/elements/l0/source/index.html. [Accessed May 9, 2022].
[22]
Melanie Kambadur, Sunpyo Hong, Juan Cabral, Harish Patil, Chi-Keung Luk, Sohaib Sajid, and Martha A Kim. 2015. Fast computational GPU design with GT-Pin. In 2015 IEEE International Symposium on Workload Characterization. IEEE, 76--86.
[23]
Lawrence Berkeley National Laboratory, National Renewable Energy Laboratory, and Sandia National Laboratories. 2019. AMR-Wind. https://github.com/Exawind/amr-wind. [Accessed Jan 17, 2022].
[24]
A Myers, A Almgren, LD Amorim, J Bell, L Fedeli, L Ge, Kevin Gott, David P Grote, M Hogan, Axel Huebl, et al. 2021. Porting WarpX to GPU-accelerated platforms. arXiv preprint arXiv:2101.12149 (2021).
[25]
Todd Mytkowicz, Devin Coughlin, and Amer Diwan. 2009. Inferred call path profiling. ACM SIGPLAN Notices 44, 10 (2009), 175--190.
[26]
National Renewable Energy Laboratory. 2019. PeleC. https://github.com/AMReX-Combustion/PeleC. [Accessed Mar 1, 2022].
[27]
NVIDIA Corporation. 2022. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-driver-api/group_CUDA_EXEC.html#group__CUDA__EXEC. [Accessed May 9, 2022].
[28]
NVIDIA Corporation. 2022. CUPTI User's Guide DA-05679-001_v10.1. https://docs.nvidia.com/cuda/pdf/CUPTI_Library.pdf. [Accessed May 5, 2022].
[29]
NVIDIA Corporation. 2022. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute. [Accessed May 7, 2022].
[30]
NVIDIA Corporation. 2022. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems. [Accessed May 17, 2022].
[31]
NVIDIA Corporation. 2022. PC Sampling. https://docs.nvidia.com/cupti/Cupti/r_main.html#r_pc_sampling. [Accessed May 4, 2022].
[32]
NVIDIA Corporation. 2022. The user manual for NVIDIA profiling tools for optimizing performance of CUDA applications. https://docs.nvidia.com/cuda/profiler-users-guide. [Accessed May 9, 2022].
[33]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in neural information processing systems 32 (2019), 8026--8037.
[34]
James C Phillips, David J Hardy, Julio DC Maia, John E Stone, João V Ribeiro, Rafael C Bernardi, Ronak Buch, Giacomo Fiorin, Jérôme Hénin, Wei Jiang, et al. 2020. Scalable molecular dynamics on CPU and GPU architectures with NAMD. The Journal of chemical physics 153, 4 (2020), 044130.
[35]
Steve Plimpton. 1995. Fast parallel algorithms for short-range molecular dynamics. Journal of computational physics 117, 1 (1995), 1--19.
[36]
James Reinders. 2005. VTune performance analyzer essentials. Intel Press.
[37]
Harald Servat, Germán Llort, Juan González, Judit Giménez, and Jesús Labarta. 2016. Bio-inspired call-stack reconstruction for performance analysis. In 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). IEEE, 82--90.
[38]
William N Sumner, Yunhui Zheng, Dasarath Weeratunge, and Xiangyu Zhang. 2011. Precise calling context encoding. IEEE Transactions on Software Engineering 38, 5 (2011), 1160--1177.
[39]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th International Conference on Parallel Processing Workshops. IEEE, 207--216.
[40]
Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 372--383.
[41]
Benjamin Welton and Barton P Miller. 2019. Diogenes: looking for an honest CPU/GPU performance measurement tool. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--20.
[42]
H. Zhang and J. Hollingsworth. 2019. Understanding the Performance of GPGPU Applications from a Data-Centric View. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). 1--8.
[43]
Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, et al. 2019. AMReX: a framework for block-structured adaptive mesh refinement. Journal of Open Source Software 4, 37 (2019), 1370--1370.
[44]
Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, and John Mellor-Crummey. 2021. Measurement and analysis of GPU-accelerated applications with HPCToolkit. Parallel Comput. 108 (2021), 102837.
[45]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.
[46]
Keren Zhou, Mark W Krentel, and John Mellor-Crummey. 2020. Tools for top-down performance analysis of GPU-accelerated applications. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.
[47]
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. 2021. GPA: A GPU Performance Advisor Based on Instruction Sampling. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 115--125.
[48]
Xiaotong Zhuang, Mauricio J Serrano, Harold W Cain, and Jong-Deok Choi. 2006. Accurate, efficient, and adaptive calling context profiling. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. 263--271.

Cited By

View all
  • (2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
June 2022
514 pages
ISBN:9781450392815
DOI:10.1145/3524059
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU performance tools
  2. GPU-accelerated Applications
  3. GPUs
  4. calling context
  5. instruction sampling
  6. profiling

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)300
  • Downloads (Last 6 weeks)33
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media