Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2145816.2145819acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A performance analysis framework for identifying potential benefits in GPGPU applications

Published: 25 February 2012 Publication History

Abstract

Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most of the traditional tools, unfortunately, simply provide programmers with a variety of measurements and metrics obtained by running applications, and it is often difficult to map these metrics to understand the root causes of slowdowns, much less decide what next optimization step to take to alleviate the bottleneck. In our approach, we first develop an analytical performance model that can precisely predict performance and aims to provide programmer-interpretable metrics. Then, we apply static and dynamic profiling to instantiate our performance model for a particular input code and show how the model can predict the potential performance benefits. We demonstrate our framework on a suite of micro-benchmarks as well as a variety of computations extracted from real codes.

References

[1]
MacSim. http://code.google.com/p/macsim/.
[2]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modeling tool for gpu architectures. In PPoPP, 2010.
[3]
A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.
[4]
J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In PPoPP, 2010.
[5]
S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A parallel functional simulator for gpgpu. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:351--360, 2010.
[6]
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT-19, 2010.
[7]
Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, 2011.
[8]
L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73(2):325--348, Dec. 1987.
[9]
S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009.
[10]
Y. Kim and A. Shrivastava. Cumapz: A tool to analyze memory access patterns in cuda. In DAC '11: Proc. of the 48th conference on Design automation, June 2011.
[11]
J. Meng, V. Morozov, K. Kumaran, V. Vishwanath, and T. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In SC'11, November 2011.
[12]
J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In ICS, 2009.
[13]
NVIDIA. CUDA OBJDUMP. http://developer.nvidia.com.
[14]
NVIDIA Corporation. NVIDIA Visual Profiler. http://developer.nvidia.com/content/nvidia-visual-profiler.
[15]
L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In SC '10, 2010.
[16]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO-6, pages 195--204, 2008.
[17]
The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.
[18]
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009.
[19]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the ACM SIGPLAN 2010 Conf. on Programming Language Design and Implementation, 2010.
[20]
Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In HPCA, 2011.

Cited By

View all
  • (2024)A Performance and Power Comparison of Contemporary GPGPU Architectures2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512242(1-5)Online publication date: 1-Mar-2024
  • (2023)FARSI: An Early-stage Design Space Exploration Framework to Tame the Domain-specific System-on-chip ComplexityACM Transactions on Embedded Computing Systems10.1145/354401622:2(1-35)Online publication date: 24-Jan-2023
  • (2022)Support structure tomography using per-pixel signed shadow casting in human manikin 3D printingFashion and Textiles10.1186/s40691-022-00290-z9:1Online publication date: 5-Jul-2022
  • Show More Cited By

Recommendations

Reviews

Amitabha Roy

General-purpose graphics processing units (GPGPUs) are becoming increasingly popular as a means to accelerate various scientific kernels, particularly evidenced by their adoption into the high-performance computing community and by the integration of GPU cores into mainstream central processing units (CPUs). However, GPU performance tuning has thus far been a niche area due to the lack of tools for determining the factors that contribute to the performance of individual program components. This is in contrast to the CPU domain where many mature tools exist that do a good job of performance analysis. This paper is an excellent first step in that direction. The authors present a performance analysis framework based on a GPU kernel that can attribute its execution time to different contributing factors. For example, it is able to separate the time spent waiting for memory access from the time spent computing results. Readers interested in performance modeling will find this approach instructive and novel. The paper starts with the construction of a detailed analytic performance model of the GPU. The authors then use a combination of statistically determined metrics (such as instruction group sizes within a basic block) with dynamically determined performance metrics (such as instruction mix) to parameterize the model. They can then predict the effect of tuning on various optimizations, some based on algorithm changes and some that can be automatically applied (such as using the available shared memory on the GPU). They also perform a deep and introspective modeling of parallelism. For example, the authors carefully separate parallelism during memory access from parallelism during computation, taking into account both the characteristics of the application and the parameters of the underlying GPU. This performance analysis framework would be useful to those interested in optimizing the execution of their GPU kernels. The paper is accessible to most people interested in performance modeling, although the terminology in some places is naturally GPU-centric and the testing (as presented) is limited to one specific GPU. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 8
    PPOPP '12
    August 2012
    334 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2370036
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPGPU architecture
  3. analytical model
  4. performance benefit prediction
  5. performance prediction

Qualifiers

  • Research-article

Conference

PPoPP '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)127
  • Downloads (Last 6 weeks)20
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Performance and Power Comparison of Contemporary GPGPU Architectures2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512242(1-5)Online publication date: 1-Mar-2024
  • (2023)FARSI: An Early-stage Design Space Exploration Framework to Tame the Domain-specific System-on-chip ComplexityACM Transactions on Embedded Computing Systems10.1145/354401622:2(1-35)Online publication date: 24-Jan-2023
  • (2022)Support structure tomography using per-pixel signed shadow casting in human manikin 3D printingFashion and Textiles10.1186/s40691-022-00290-z9:1Online publication date: 5-Jul-2022
  • (2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022
  • (2022)A Machine Learning Framework for Predicting the Glass Transition Temperature of HomopolymersIndustrial & Engineering Chemistry Research10.1021/acs.iecr.2c0130261:34(12690-12698)Online publication date: 9-Aug-2022
  • (2021)Automated SmartNIC Offloading Insights for Network FunctionsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483583(772-787)Online publication date: 26-Oct-2021
  • (2021)Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices2021 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC53511.2021.00030(226-237)Online publication date: Nov-2021
  • (2021)Experience and Practice Teaching an Undergraduate Course on Diverse Heterogeneous Architectures2021 IEEE/ACM Ninth Workshop on Education for High Performance Computing (EduHPC)10.1109/EduHPC54835.2021.00006(1-8)Online publication date: Nov-2021
  • (2021)HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUsThe Journal of Supercomputing10.1007/s11227-021-03746-zOnline publication date: 12-Apr-2021
  • (2020)ClaraProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425929(16-22)Online publication date: 4-Nov-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media