research-article

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Authors:

Christopher I. Rodrigues,

Sara S. Baghsorkhi,

Wen-mei W. HwuAuthors Info & Claims

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Pages 73 - 82

https://doi.org/10.1145/1345206.1345220

Published: 20 February 2008 Publication History

Abstract

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.

References

[1]

AMD Stream Processor. http://ati.amd.com/products/ streamprocessor/index.html.

[2]

CUDA benchmark suite. http://www.crhc.uiuc.edu/impact/cudabench.html.

[3]

NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.

[4]

The PeakStream platform: High productivity software development for multi-core processors. Technical report, 2006.

[5]

ECE 498AL1: Programming massively parallel processors, Fall 2007. http://courses.ece.uiuc.edu/ece498/al1/.

[6]

J. C. Adams, W. S. Brainerd, J. T. Martin, B. T. Smith, and J. L. Wagener. Fortran 90 handbook: complete ANSI/ISO reference. Intertext Publications, Inc.,/McGraw-Hill, Inc., 1992.

Digital Library

[7]

R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Langugages and Systems, 9(4):491--542, 1987.

Digital Library

[8]

M. J. Atallah, editor. Algorithms and Theory of Computation Handbook. CRC Press LLC, 1998.

Digital Library

[9]

I. Buck. Brook Specification v0.2, October 2003.

[10]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. ACM SIGPLAN Notices, 9(4):328--342, 2004.

Digital Library

[11]

K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004.

Digital Library

[12]

N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, number 89, 2006.

Digital Library

[13]

K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[14]

M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, April 1991.

Digital Library

[15]

D. B. Loveman. High Performance Fortran. IEEE Parallel & Distributed Technology: Systems & Technology, 1(1):25--42, 1993.

Digital Library

[16]

W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for programming graphics hardware in a C-like language. In ACM SIGGRAPH 2003 Papers, pages 896--907, 2003.

Digital Library

[17]

M. D. McCool, K. Wadleigh, B. Henderson, and H.-Y. Lin. Performance evaluation of GPUs using the RapidMind development platform. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.

Digital Library

[18]

J. Nickolls and I. Buck. NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum, May 2007.

[19]

OpenMP Architecture Review Board. OpenMP application program interface, May 2005.

[20]

J. Owens. Streaming architectures and technology trends. GPU Gems 2, pages 457--470, March 2005.

[21]

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüuger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.

[22]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, and W. W. Hwu. Program optimization study on a 128-core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.

[23]

M. Snir, S. W. Otto, D. W. Walker, J. Dongarra, and S. Huss-Lederman. MPI: The Complete Reference. MIT Press, 1995.

Digital Library

[24]

J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618--2640, December 2007.

[25]

S. S. Stone, H. Yi, W. W. Hwu, J. P. Haldar, B. P. Sutton, and Z.-P. Liang. How GPUs can improve the quality of magnetic resonance imaging. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.

[26]

D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 325--335, 2006.

Digital Library

[27]

P. H.Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 156--166, 2007.

Digital Library

[28]

M. J. Wolfe. Optimizing Supercompilers for Supercomputers. MIT Press, 1990.

Digital Library

Cited By

Qi SMa HZou YYuan YXie ZLi PCheng X(2024)Backdoor Attack to Giant Model in Fragment-Sharing Federated LearningBig Data Mining and Analytics10.26599/BDMA.2024.90200357:4(1084-1097)Online publication date: Dec-2024
https://doi.org/10.26599/BDMA.2024.9020035
Tiulkin B(2024)Parallel Computing Technologies and Rendering Optimization in the Problem of Fluid Simulation by the Example of the Incompressible Schrödinger Flow MethodPhysics of Particles and Nuclei10.1134/S106377962403088255:3(519-521)Online publication date: 6-Jun-2024
https://doi.org/10.1134/S1063779624030882
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Show More Cited By

Index Terms

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

OpenCL implementation of particle swarm optimization: a comparison between multi-core CPU and GPU performances
EvoApplications'12: Proceedings of the 2012t European conference on Applications of Evolutionary Computation

GPU-based parallel implementations of algorithms are usually compared against the corresponding sequential versions compiled for a single-core CPU machine, without taking advantage of the multi-core and SIMD capabilities of modern processors. This leads ...
Spiking neural P system simulations on a high performance GPU platform
ICA3PP'11: Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II

In this paper we present our results in adapting a Spiking Neural P system (SNP system) simulator to a high performance graphics processing unit (GPU) platform. In particular, we extend our simulations to larger and more complex SNP systems using an ...
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

February 2008

308 pages

ISBN:9781595937957

DOI:10.1145/1345206

General Chair:
Siddhartha Chatterjee
IBM Research USA
,
Program Chair:
Michael L. Scott
University of Rochester USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP08

Sponsor:

PPoPP08: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 20 - 23, 2008

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

580
Total Citations
View Citations
10,403
Total Downloads

Downloads (Last 12 months)258
Downloads (Last 6 weeks)29

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qi SMa HZou YYuan YXie ZLi PCheng X(2024)Backdoor Attack to Giant Model in Fragment-Sharing Federated LearningBig Data Mining and Analytics10.26599/BDMA.2024.90200357:4(1084-1097)Online publication date: Dec-2024
https://doi.org/10.26599/BDMA.2024.9020035
Tiulkin B(2024)Parallel Computing Technologies and Rendering Optimization in the Problem of Fluid Simulation by the Example of the Incompressible Schrödinger Flow MethodPhysics of Particles and Nuclei10.1134/S106377962403088255:3(519-521)Online publication date: 6-Jun-2024
https://doi.org/10.1134/S1063779624030882
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Haghi BAflalo TKellis SGuan CGamez de Leon JHuang APouratian NAndersen REmami A(2024)Enhanced control of a brain–computer interface by tetraplegic participants via neural-network-mediated feature extractionNature Biomedical Engineering10.1038/s41551-024-01297-1Online publication date: 6-Dec-2024
https://doi.org/10.1038/s41551-024-01297-1
Chowdhury TBhattacharya CChowdhury SNath MDe M(2024)GPU Accelerated MapReduce-Based Distributed Framework for Knowledge Extraction from Large Uncertain DataSN Computer Science10.1007/s42979-024-03442-85:8Online publication date: 30-Nov-2024
https://doi.org/10.1007/s42979-024-03442-8
Yang Y(2023)Write What You Want: Applying Text-to-Video Retrieval to Audiovisual ArchivesJournal on Computing and Cultural Heritage 10.1145/362716716:4(1-16)Online publication date: 16-Nov-2023
https://dl.acm.org/doi/10.1145/3627167
Artale AJung JMazzullo AOzaki AWolter F(2023)Living without Beth and Craig: Definitions and Interpolants in Description and Modal Logics with Nominals and Role InclusionsACM Transactions on Computational Logic10.1145/359730124:4(1-51)Online publication date: 10-Oct-2023
https://dl.acm.org/doi/10.1145/3597301
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Jiang SHuang THo T(2023)GLARE: Accelerating Sparse DNN Inference Kernels with Global Memory Access Reduction2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363578(1-7)Online publication date: 25-Sep-2023
https://doi.org/10.1109/HPEC58863.2023.10363578
Emonds YBraun LFröning H(2023)CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00021(119-130)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00021
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents