research-article

Public Access

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Authors:

Shuaiwen Leon Song,

Xu LiuAuthors Info & Claims

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 214 - 227

https://doi.org/10.1145/3168831

Published: 24 February 2018 Publication History

Abstract

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

References

[1]

2017. NVIDIA Visual Profiler. NVIDIA. http://docs.nvidia.com/cuda/ profiler-users-guide

[2]

Jun. 2017. Top500 supercomputer sites. https://www.top500.org/lists/ 2017/06 . (Jun. 2017).

[3]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. MellorCrummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22 (2010), 685–701.

Digital Library

[4]

R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 25–38.

Digital Library

[5]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.

[6]

David Böhme, Markus Geimer, Lukas Arnold, Felix Voigtlaender, and Felix Wolf. 2016. Identifying the Root Causes of Wait States in LargeScale Parallel Applications. ACM Trans. Parallel Comput. 3, 2, Article 11 (July 2016), 24 pages.

Digital Library

[7]

Milind Chabbi, Karthik Murthy, Michael Fagan, and John MellorCrummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, New York, NY, USA, Article 43, 12 pages.

Digital Library

[8]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization. IEEE International Symposium on. IEEE, 44–54.

Digital Library

[9]

Guoyang Chen and Xipeng Shen. 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 407–419.

Digital Library

[10]

Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343–355.

Digital Library

[11]

NVIDIA Corp. 2011. CUDA Tools SDK CUPTI User’s Guide DA-05679-001_v01. https://developer.nvidia.com/nvidia-visual-profiler . (October 2011).

[12]

NVIDIA Corp. 2017. NVIDIA Nsight. http://www.nvidia.com/object/ nsight.html . (2017).

[13]

Zheng Cui, Yun Liang, Kyle Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 83–94.

Digital Library

[14]

Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT ’10). ACM, New York, NY, USA, 353–364.

Digital Library

[15]

Chen Ding and Zhong Yuntao. 2001. Reuse Distance Analysis. In Computer Science at University of Rochester Tech report UR-CS-TR-741. U of Rochester.

Digital Library

[16]

Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. 2013. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 395–407.

Digital Library

[17]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE.

[18]

LLVM Group. 2016. LLVM: User Guide for NVPTX Back-end. http: //llvm.org/docs/NVPTXUsage.html . (2016).

[19]

NVIDIA Group. 2017. NVIDIA DGX-1 AI Supercomputer. http://www. nvidia.com/object/deep-learning-system.html . (2017).

[20]

Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience 24, 1 (2012), 62–72.

[21]

Google Inc. 2017. TensorFlow: An open-source software library for Machine Intelligence. https://www.tensorflow.org . (2017).

[22]

Intel 2017. Intel VTune Amplifier XE 2017. http://software.intel.com/ en-us/intel-vtune-amplifier-xe . (April 2017).

[23]

Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08 (2014).

[24]

W. Jia, K. A. Shaw, and M. Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 272–283.

[25]

Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 332–343.

Digital Library

[26]

Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157– 166.

Digital Library

[27]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California.

Digital Library

[28]

Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 213–224.

Digital Library

[29]

Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. AsanoviÄĞ. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–11.

Digital Library

[30]

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XXII). ACM, New York,NY, USA.

Digital Library

[31]

Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 17.

Digital Library

[32]

Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 67–77.

Digital Library

[33]

Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. 2016. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’17). ACM, 43.

Digital Library

[34]

Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP ’11). IEEE Computer Society, Washington, DC, USA, 176–185.

Digital Library

[35]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 235– 246.

Digital Library

[36]

C. Nugteren, G. J. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 37–48.

[37]

NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. https://devblogs.nvidia.com/parallelforall/ cuda-7-5-pinpoint-performance-problems-instruction-level-profiling . (2015).

[38]

NVIDIA. 2015. CUDA Programming Guide. (2015). http://docs.nvidia. com/cuda/cuda-c-programming-guide

[39]

Oracle. 2012. Oracle Solaris Studio. http://www.oracle.com/ technetwork/server-storage/solarisstudio/overview/index.html . (2012).

[40]

Keshav Pingal. 2014. Galois. http://iss.ices.utexas.edu/?p=projects/ galois . (2014).

[41]

Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1–19.

Digital Library

[42]

Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86–98.

Digital Library

[43]

Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72–83.

Digital Library

[44]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99–110.

Digital Library

[45]

A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 174–185.

[46]

Shuaiwen Leon Song, Chunyi Su, Barry Rountree, and Kirk W Cameron. 2013. A simplified and accurate model of powerperformance efficiency on emergent GPU architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 673–686.

Digital Library

[47]

Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 185–197.

Digital Library

[48]

John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73.

Digital Library

[49]

Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, New York, NY, USA, 3–15.

Digital Library

[50]

Dominic A. Varley. 1993. Practical experience of the limitations of Gprof. Software: Practice and Experience 23, 4 (1993), 461–463.

Digital Library

[51]

Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. GPUCC - An Open-Source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. New York, NY, 105–116. http://dl.acm.org/citation. cfm?id=2854041

Digital Library

[52]

P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 284–295.

[53]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395–406.

Digital Library

[54]

Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD ’13). IEEE Press, Piscataway, NJ, USA, 516–523. http: //dl.acm.org/citation.cfm?id=2561828.2561929

Digital Library

[55]

X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88.

[56]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. SIGPLAN Not. 46, 3 (March 2011), 369–380.

Digital Library

Cited By

Darche SDagenais M(2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3649510
Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Song LChen FLi HChen YMohror KArnold DBadia R(2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607077
Show More Cited By

Index Terms

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Implementing OpenMP’s SIMD Directive in LLVM’s GPU Runtime
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

GPUs support three levels of parallelism: thread blocks, warps (or wavefronts) within a block, and threads within a warp. Some GPU programming models allow the use of all three of these levels, such as OpenMP offloading with the teams, parallel, and simd ...
Efficient execution of OpenMP on GPUs
CGO '22: Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization

OpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This ...
Leveraging GPUs using cooperative loop speculation

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
DoE

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
2,045
Total Downloads

Downloads (Last 12 months)442
Downloads (Last 6 weeks)49

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Darche SDagenais M(2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3649510
Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Song LChen FLi HChen YMohror KArnold DBadia R(2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607077
Hao YJain NVan der Wijngaart RSaxena NFan YLiu XVieira MCardellini VDi Marco ATuma P(2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578244.3583736
You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934
Zhong FCheng XYu DGong BSong SYu J(2023)MalFox: Camouflaged Adversarial Malware Example Generation Based on Conv-GANs Against Black-Box DetectorsIEEE Transactions on Computers10.1109/TC.2023.323690173:4(980-993)Online publication date: 13-Jan-2023
https://dl.acm.org/doi/10.1109/TC.2023.3236901
Bloch ACasale-Brunet SMattavelli M(2022)Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network ExecutionJournal of Low Power Electronics and Applications10.3390/jlpea1203003612:3(36)Online publication date: 23-Jun-2022
https://doi.org/10.3390/jlpea12030036
Arafa YBadawy AElWazir ABarai AEker AChennupati GSanthi NEidenbenz Sde Supinski BHall MGamblin T(2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476221
Tripathy DAbdolrashidi ABhuyan LZhou LWong D(2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3451164
Zhang CYuan GNiu WTian JJin SZhuang DJiang ZWang YRen BSong STao DZhou HMoreira JMueller FEtsion Y(2021)ClickTrainProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3459988(266-278)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3459988
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents