Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1693453.1693470acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

An adaptive performance modeling tool for GPU architectures

Published: 09 January 2010 Publication History

Abstract

This paper presents an analytical model to predict the performance of
general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.

References

[1]
ATI Stream Computing. http://developer.amd.com/gpu-assets/Stream-Computing-Overview.pdf.
[2]
M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector multiplication on GPUs. IBM Research Report, December 2008.
[3]
M. Clement and M. Quinn. Analytical Performance Prediction on Multicomputers. In ACM/IEEE Conference on Supercomputing, November 1993.
[4]
R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst., pages 451--490, 1991.
[5]
T. Davis. University of Florida Sparse Matrix Collection. http://www.cise.u.edu/research/sparse/matrices/.
[6]
K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In Conference on Graphics Hardware, August 2004.
[7]
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The Program Dependence Graph and its Use in Optimization. ACM Trans. Program. Lang. Syst., pages 319--349, 1987.
[8]
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In ACM/IEEE Conference on Supercomputing, November 2006.
[9]
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In ACM/IEEE Conference on Supercomputing, November 2008.
[10]
S. Hong and H. Kim. An model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In International Symposium on Computer Architecture, June 2009.
[11]
C. Jiang and M. Snir. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. In International Conference on Parallel Architectures and Compilation Techniques, September 2005.
[12]
W. Liu, W. Muller-Wittig, and B. Schmidt. Performance Predictions for General-Purpose Computation on GPUs. In International Conference on Parallel Processing, September 2007.
[13]
M. Kongstad. An Implementation of Global Value Numbering in the GNU Compiler Collection with Performance Measurements, October 2004.
[14]
G. Marin and J. Mellor-Crummey. Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models. In International conference on Measurement and modeling of computer systems, June 2004.
[15]
A. Munshi. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.
[16]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, 2008.
[17]
NVIDIA Corporation. NVIDIA CUDA Programming Guide: Version 1.0, June 2007.
[18]
Z. Pan and R. Eigenmann. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, March 2006.
[19]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In International Symposium on Code Generation and Optimization, April 2008.
[20]
D. Schaa and D. Kaeli. Exploring the Multi GPU Design Space. In International Symposium on Parallel and Distributed Processing, October 2009.
[21]
D. B. Whalley. Tuning High Performance Kernels through Empirical Compilation. In International Conference on Parallel Processing, June 2005.

Cited By

View all
  • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
  • (2024)ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00039(361-372)Online publication date: 27-May-2024
  • (2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2010
372 pages
ISBN:9781605588773
DOI:10.1145/1693453
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 45, Issue 5
    PPoPP '10
    May 2010
    346 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/1837853
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. analytical model
  2. gpu
  3. parallel programming
  4. performance estimation

Qualifiers

  • Research-article

Conference

PPoPP '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)141
  • Downloads (Last 6 weeks)18
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
  • (2024)ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00039(361-372)Online publication date: 27-May-2024
  • (2023)Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614277(380-394)Online publication date: 28-Oct-2023
  • (2023)CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00021(119-130)Online publication date: May-2023
  • (2022)GCoMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527384(424-436)Online publication date: 18-Jun-2022
  • (2022)COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00074(391-400)Online publication date: May-2022
  • (2022)Symbolic identification of shared memory based bank conflicts for GPUsJournal of Systems Architecture10.1016/j.sysarc.2022.102518127(102518)Online publication date: Jun-2022
  • (2022)Performance predictors for graphics processing units applied to dark‐silicon‐aware design space explorationConcurrency and Computation: Practice and Experience10.1002/cpe.687735:17Online publication date: 4-Mar-2022
  • (2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
  • (2021)Performance Evaluation Model for Matrix Calculation on GPUInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142154030635:15Online publication date: 15-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media