research-article

An adaptive performance modeling tool for GPU architectures

Authors:

Sara S. Baghsorkhi,

Matthieu Delahaye,

Sanjay J. Patel,

William D. Gropp,

Wen-mei W. HwuAuthors Info & Claims

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 105 - 114

https://doi.org/10.1145/1693453.1693470

Published: 09 January 2010 Publication History

Abstract

This paper presents an analytical model to predict the performance of

general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.

References

[1]

ATI Stream Computing. http://developer.amd.com/gpu-assets/Stream-Computing-Overview.pdf.

[2]

M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector multiplication on GPUs. IBM Research Report, December 2008.

[3]

M. Clement and M. Quinn. Analytical Performance Prediction on Multicomputers. In ACM/IEEE Conference on Supercomputing, November 1993.

Digital Library

[4]

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst., pages 451--490, 1991.

Digital Library

[5]

T. Davis. University of Florida Sparse Matrix Collection. http://www.cise.u.edu/research/sparse/matrices/.

[6]

K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In Conference on Graphics Hardware, August 2004.

Digital Library

[7]

J. Ferrante, K. J. Ottenstein, and J. D. Warren. The Program Dependence Graph and its Use in Optimization. ACM Trans. Program. Lang. Syst., pages 319--349, 1987.

Digital Library

[8]

N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In ACM/IEEE Conference on Supercomputing, November 2006.

Digital Library

[9]

N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In ACM/IEEE Conference on Supercomputing, November 2008.

Digital Library

[10]

S. Hong and H. Kim. An model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In International Symposium on Computer Architecture, June 2009.

Digital Library

[11]

C. Jiang and M. Snir. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. In International Conference on Parallel Architectures and Compilation Techniques, September 2005.

Digital Library

[12]

W. Liu, W. Muller-Wittig, and B. Schmidt. Performance Predictions for General-Purpose Computation on GPUs. In International Conference on Parallel Processing, September 2007.

Digital Library

[13]

M. Kongstad. An Implementation of Global Value Numbering in the GNU Compiler Collection with Performance Measurements, October 2004.

[14]

G. Marin and J. Mellor-Crummey. Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models. In International conference on Measurement and modeling of computer systems, June 2004.

Digital Library

[15]

A. Munshi. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.

[16]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, 2008.

Digital Library

[17]

NVIDIA Corporation. NVIDIA CUDA Programming Guide: Version 1.0, June 2007.

[18]

Z. Pan and R. Eigenmann. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, March 2006.

Digital Library

[19]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In International Symposium on Code Generation and Optimization, April 2008.

Digital Library

[20]

D. Schaa and D. Kaeli. Exploring the Multi GPU Design Space. In International Symposium on Parallel and Distributed Processing, October 2009.

Digital Library

[21]

D. B. Whalley. Tuning High Performance Kernels through Empirical Compilation. In International Conference on Parallel Processing, June 2005.

Digital Library

Cited By

Darche SDagenais M(2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3649510
Lin YChen YGobriel SJain NJha GPrasanna V(2024)ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00039(361-372)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00039
Li YSun YJog A(2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614277
Show More Cited By

Index Terms

An adaptive performance modeling tool for GPU architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance ...
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
An adaptive performance modeling tool for GPU architectures
PPoPP '10

This paper presents an analytical model to predict the performance of

general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2010

372 pages

ISBN:9781605588773

DOI:10.1145/1693453

General Chairs:
R. Govindarajan
Indian Institute of Science
,
David Padua
UIUC
,
Program Chair:
Mary Hall
University of Utah

ACM SIGPLAN Notices Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '10

Sponsor:

SIGPLAN

PPoPP '10: ACM SIGPLAN Principles and Practice of Parallel Computing

January 9 - 14, 2010

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

230
Total Citations
View Citations
2,815
Total Downloads

Downloads (Last 12 months)141
Downloads (Last 6 weeks)18

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Darche SDagenais M(2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3649510
Lin YChen YGobriel SJain NJha GPrasanna V(2024)ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00039(361-372)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00039
Li YSun YJog A(2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614277
Emonds YBraun LFröning H(2023)CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00021(119-130)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00021
Lee JHa YLee SWoo JLee JJang HKim YSalapura VZahran MChong FTang L(2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527384
Mishra AChheda SSoto CMalik ALin MChapman B(2022)COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00074(391-400)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00074
Horga ARezine AChattopadhyay SEles PPeng Z(2022)Symbolic identification of shared memory based bank conflicts for GPUsJournal of Systems Architecture10.1016/j.sysarc.2022.102518127(102518)Online publication date: Jun-2022
https://doi.org/10.1016/j.sysarc.2022.102518
Sonohata RArigoni DFernandes ERibeiro dos Santos RDessandre Duenha L(2022)Performance predictors for graphics processing units applied to dark‐silicon‐aware design space explorationConcurrency and Computation: Practice and Experience10.1002/cpe.687735:17Online publication date: 4-Mar-2022
https://doi.org/10.1002/cpe.6877
Arafa YBadawy AElWazir ABarai AEker AChennupati GSanthi NEidenbenz Sde Supinski BHall MGamblin T(2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476221
Yin MXu XZhang TYe C(2021)Performance Evaluation Model for Matrix Calculation on GPUInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142154030635:15Online publication date: 15-Oct-2021
https://doi.org/10.1142/S0218001421540306
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents