research-article

A performance analysis framework for identifying potential benefits in GPGPU applications

Authors:

Jaewoong Sim,

Aniruddha Dasgupta,

Hyesoon Kim,

Richard VuducAuthors Info & Claims

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Pages 11 - 22

https://doi.org/10.1145/2145816.2145819

Published: 25 February 2012 Publication History

Get Access

Abstract

Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most of the traditional tools, unfortunately, simply provide programmers with a variety of measurements and metrics obtained by running applications, and it is often difficult to map these metrics to understand the root causes of slowdowns, much less decide what next optimization step to take to alleviate the bottleneck. In our approach, we first develop an analytical performance model that can precisely predict performance and aims to provide programmer-interpretable metrics. Then, we apply static and dynamic profiling to instantiate our performance model for a particular input code and show how the model can predict the potential performance benefits. We demonstrate our framework on a suite of micro-benchmarks as well as a variety of computations extracted from real codes.

References

[1]

MacSim. http://code.google.com/p/macsim/.

Google Scholar

[2]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modeling tool for gpu architectures. In PPoPP, 2010.

Digital Library

Google Scholar

[3]

A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.

Crossref

Google Scholar

[4]

J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In PPoPP, 2010.

Digital Library

Google Scholar

[5]

S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A parallel functional simulator for gpgpu. Modeling, Analysis, and Simulation of Computer Systems, International Symposium on, 0:351--360, 2010.

Digital Library

Google Scholar

[6]

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT-19, 2010.

Digital Library

Google Scholar

[7]

Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast fourier transform on graphics processors. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, 2011.

Digital Library

Google Scholar

[8]

L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73(2):325--348, Dec. 1987.

Digital Library

Google Scholar

[9]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009.

Digital Library

Google Scholar

[10]

Y. Kim and A. Shrivastava. Cumapz: A tool to analyze memory access patterns in cuda. In DAC '11: Proc. of the 48th conference on Design automation, June 2011.

Digital Library

Google Scholar

[11]

J. Meng, V. Morozov, K. Kumaran, V. Vishwanath, and T. Uram. Grophecy: Gpu performance projection from cpu code skeletons. In SC'11, November 2011.

Digital Library

Google Scholar

[12]

J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In ICS, 2009.

Digital Library

Google Scholar

[13]

NVIDIA. CUDA OBJDUMP. http://developer.nvidia.com.

Google Scholar

[14]

NVIDIA Corporation. NVIDIA Visual Profiler. http://developer.nvidia.com/content/nvidia-visual-profiler.

Google Scholar

[15]

L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In SC '10, 2010.

Digital Library

Google Scholar

[16]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO-6, pages 195--204, 2008.

Digital Library

Google Scholar

[17]

The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.

Google Scholar

[18]

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009.

Digital Library

Google Scholar

[19]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the ACM SIGPLAN 2010 Conf. on Programming Language Design and Implementation, 2010.

Digital Library

Google Scholar

[20]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In HPCA, 2011.

Digital Library

Google Scholar

Cited By

View all

Shenoy G(2024)A Performance and Power Comparison of Contemporary GPGPU Architectures2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512242(1-5)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10512242
Boroujerdian BJing YTripathy DKumar ASubramanian LYen LLee VVenkatesan VJindal AShearer RReddi V(2023)FARSI: An Early-stage Design Space Exploration Framework to Tame the Domain-specific System-on-chip ComplexityACM Transactions on Embedded Computing Systems10.1145/354401622:2(1-35)Online publication date: 24-Jan-2023
https://dl.acm.org/doi/10.1145/3544016
Jung JChee SSul I(2022)Support structure tomography using per-pixel signed shadow casting in human manikin 3D printingFashion and Textiles10.1186/s40691-022-00290-z9:1Online publication date: 5-Jul-2022
https://doi.org/10.1186/s40691-022-00290-z
Show More Cited By

Index Terms

A performance analysis framework for identifying potential benefits in GPGPU applications

Recommendations

A performance analysis framework for identifying potential benefits in GPGPU applications
PPOPP '12

Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light ...
PPT-GPU: performance prediction toolkit for GPUs identifying the impact of caches: extended abstract
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

In the early days, computers only had central processing units or CPUs. High performance computing capabilities are now in high demand. Emerging applications such as deep learning, augmented and virtual reality, and video processing require accelerators ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...

Reviews

Reviewer: Amitabha Roy

General-purpose graphics processing units (GPGPUs) are becoming increasingly popular as a means to accelerate various scientific kernels, particularly evidenced by their adoption into the high-performance computing community and by the integration of GPU cores into mainstream central processing units (CPUs). However, GPU performance tuning has thus far been a niche area due to the lack of tools for determining the factors that contribute to the performance of individual program components. This is in contrast to the CPU domain where many mature tools exist that do a good job of performance analysis. This paper is an excellent first step in that direction. The authors present a performance analysis framework based on a GPU kernel that can attribute its execution time to different contributing factors. For example, it is able to separate the time spent waiting for memory access from the time spent computing results. Readers interested in performance modeling will find this approach instructive and novel. The paper starts with the construction of a detailed analytic performance model of the GPU. The authors then use a combination of statistically determined metrics (such as instruction group sizes within a basic block) with dynamically determined performance metrics (such as instruction mix) to parameterize the model. They can then predict the effect of tuning on various optimizations, some based on algorithm changes and some that can be automatically applied (such as using the available shared memory on the GPU). They also perform a deep and introspective modeling of parallelism. For example, the authors carefully separate parallelism during memory access from parallelism during computation, taking into account both the characteristics of the application and the parameters of the underlying GPU. This performance analysis framework would be useful to those interested in optimizing the execution of their GPU kernels. The paper is accessible to most people interested in performance modeling, although the terminology in some places is naturally GPU-centric and the testing (as presented) is limited to one specific GPU. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

February 2012

352 pages

ISBN:9781450311601

DOI:10.1145/2145816

General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA

ACM SIGPLAN Notices Volume 47, Issue 8
PPOPP '12
August 2012
334 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2370036
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '12

Sponsor:

SIGPLAN

PPoPP '12: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 25 - 29, 2012

Louisiana, New Orleans, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

180
Total Citations
View Citations
2,038
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)20

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Shenoy G(2024)A Performance and Power Comparison of Contemporary GPGPU Architectures2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512242(1-5)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10512242
Boroujerdian BJing YTripathy DKumar ASubramanian LYen LLee VVenkatesan VJindal AShearer RReddi V(2023)FARSI: An Early-stage Design Space Exploration Framework to Tame the Domain-specific System-on-chip ComplexityACM Transactions on Embedded Computing Systems10.1145/354401622:2(1-35)Online publication date: 24-Jan-2023
https://dl.acm.org/doi/10.1145/3544016
Jung JChee SSul I(2022)Support structure tomography using per-pixel signed shadow casting in human manikin 3D printingFashion and Textiles10.1186/s40691-022-00290-z9:1Online publication date: 5-Jul-2022
https://doi.org/10.1186/s40691-022-00290-z
Lee JHa YLee SWoo JLee JJang HKim YSalapura VZahran MChong FTang L(2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527384
Nguyen TBavarian M(2022)A Machine Learning Framework for Predicting the Glass Transition Temperature of HomopolymersIndustrial & Engineering Chemistry Research10.1021/acs.iecr.2c0130261:34(12690-12698)Online publication date: 9-Aug-2022
https://doi.org/10.1021/acs.iecr.2c01302
Qiu YXing JHsu KKang QLiu MNarayana SChen A(2021)Automated SmartNIC Offloading Insights for Network FunctionsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483583(772-787)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483583
Shafi ORai CSen RAnanthanarayanan G(2021)Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices2021 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC53511.2021.00030(226-237)Online publication date: Nov-2021
https://doi.org/10.1109/IISWC53511.2021.00030
Frachtenberg E(2021)Experience and Practice Teaching an Undergraduate Course on Diverse Heterogeneous Architectures2021 IEEE/ACM Ninth Workshop on Education for High Performance Computing (EduHPC)10.1109/EduHPC54835.2021.00006(1-8)Online publication date: Nov-2021
https://doi.org/10.1109/EduHPC54835.2021.00006
Fu HTang SHe BYu CSun J(2021)HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUsThe Journal of Supercomputing10.1007/s11227-021-03746-zOnline publication date: 12-Apr-2021
https://doi.org/10.1007/s11227-021-03746-z
Qiu YKang QLiu MChen AZhao BZheng HMadhyastha HPadmanabhan V(2020)ClaraProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425929(16-22)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.1145/3422604.3425929
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations