Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3168831acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article
Public Access

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Published: 24 February 2018 Publication History

Abstract

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

References

[1]
2017. NVIDIA Visual Profiler. NVIDIA. http://docs.nvidia.com/cuda/ profiler-users-guide
[2]
Jun. 2017. Top500 supercomputer sites. https://www.top500.org/lists/ 2017/06 . (Jun. 2017).
[3]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. MellorCrummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22 (2010), 685–701.
[4]
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 25–38.
[5]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.
[6]
David Böhme, Markus Geimer, Lukas Arnold, Felix Voigtlaender, and Felix Wolf. 2016. Identifying the Root Causes of Wait States in LargeScale Parallel Applications. ACM Trans. Parallel Comput. 3, 2, Article 11 (July 2016), 24 pages.
[7]
Milind Chabbi, Karthik Murthy, Michael Fagan, and John MellorCrummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, New York, NY, USA, Article 43, 12 pages.
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization. IEEE International Symposium on. IEEE, 44–54.
[9]
Guoyang Chen and Xipeng Shen. 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 407–419.
[10]
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343–355.
[11]
NVIDIA Corp. 2011. CUDA Tools SDK CUPTI User’s Guide DA-05679-001_v01. https://developer.nvidia.com/nvidia-visual-profiler . (October 2011).
[12]
NVIDIA Corp. 2017. NVIDIA Nsight. http://www.nvidia.com/object/ nsight.html . (2017).
[13]
Zheng Cui, Yun Liang, Kyle Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 83–94.
[14]
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT ’10). ACM, New York, NY, USA, 353–364.
[15]
Chen Ding and Zhong Yuntao. 2001. Reuse Distance Analysis. In Computer Science at University of Rochester Tech report UR-CS-TR-741. U of Rochester.
[16]
Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. 2013. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 395–407.
[17]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE.
[18]
LLVM Group. 2016. LLVM: User Guide for NVPTX Back-end. http: //llvm.org/docs/NVPTXUsage.html . (2016).
[19]
NVIDIA Group. 2017. NVIDIA DGX-1 AI Supercomputer. http://www. nvidia.com/object/deep-learning-system.html . (2017).
[20]
Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience 24, 1 (2012), 62–72.
[21]
Google Inc. 2017. TensorFlow: An open-source software library for Machine Intelligence. https://www.tensorflow.org . (2017).
[22]
Intel 2017. Intel VTune Amplifier XE 2017. http://software.intel.com/ en-us/intel-vtune-amplifier-xe . (April 2017).
[23]
Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08 (2014).
[24]
W. Jia, K. A. Shaw, and M. Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 272–283.
[25]
Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 332–343.
[26]
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157– 166.
[27]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California.
[28]
Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 213–224.
[29]
Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. AsanoviÄĞ. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–11.
[30]
Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XXII). ACM, New York,NY, USA.
[31]
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 17.
[32]
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 67–77.
[33]
Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. 2016. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’17). ACM, 43.
[34]
Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP ’11). IEEE Computer Society, Washington, DC, USA, 176–185.
[35]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 235– 246.
[36]
C. Nugteren, G. J. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 37–48.
[37]
NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. https://devblogs.nvidia.com/parallelforall/ cuda-7-5-pinpoint-performance-problems-instruction-level-profiling . (2015).
[38]
NVIDIA. 2015. CUDA Programming Guide. (2015). http://docs.nvidia. com/cuda/cuda-c-programming-guide
[39]
Oracle. 2012. Oracle Solaris Studio. http://www.oracle.com/ technetwork/server-storage/solarisstudio/overview/index.html . (2012).
[40]
Keshav Pingal. 2014. Galois. http://iss.ices.utexas.edu/?p=projects/ galois . (2014).
[41]
Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1–19.
[42]
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86–98.
[43]
Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72–83.
[44]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99–110.
[45]
A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 174–185.
[46]
Shuaiwen Leon Song, Chunyi Su, Barry Rountree, and Kirk W Cameron. 2013. A simplified and accurate model of powerperformance efficiency on emergent GPU architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 673–686.
[47]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 185–197.
[48]
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73.
[49]
Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, New York, NY, USA, 3–15.
[50]
Dominic A. Varley. 1993. Practical experience of the limitations of Gprof. Software: Practice and Experience 23, 4 (1993), 461–463.
[51]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. GPUCC - An Open-Source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. New York, NY, 105–116. http://dl.acm.org/citation. cfm?id=2854041
[52]
P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 284–295.
[53]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395–406.
[54]
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD ’13). IEEE Press, Piscataway, NJ, USA, 516–523. http: //dl.acm.org/citation.cfm?id=2561828.2561929
[55]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88.
[56]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. SIGPLAN Not. 46, 3 (March 2011), 369–380.

Cited By

View all
  • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
  • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
  • (2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
  • Show More Cited By

Index Terms

  1. CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization
        February 2018
        377 pages
        ISBN:9781450356176
        DOI:10.1145/3179541
        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication Notes

        Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

        Publication History

        Published: 24 February 2018

        Permissions

        Request permissions for this article.

        Check for updates

        Badges

        Author Tags

        1. GPU
        2. LLVM
        3. Optimization
        4. Profiling

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        CGO '18
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 312 of 1,061 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)442
        • Downloads (Last 6 weeks)49
        Reflects downloads up to 26 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
        • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
        • (2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
        • (2023)DrGPU: A Top-Down Profiler for GPU ApplicationsProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583736(43-53)Online publication date: 15-Apr-2023
        • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
        • (2023)MalFox: Camouflaged Adversarial Malware Example Generation Based on Conv-GANs Against Black-Box DetectorsIEEE Transactions on Computers10.1109/TC.2023.323690173:4(980-993)Online publication date: 13-Jan-2023
        • (2022)Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network ExecutionJournal of Low Power Electronics and Applications10.3390/jlpea1203003612:3(36)Online publication date: 23-Jun-2022
        • (2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
        • (2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
        • (2021)ClickTrainProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3459988(266-278)Online publication date: 3-Jun-2021
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media