Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1345206.1345220acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Published: 20 February 2008 Publication History

Abstract

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.

References

[1]
AMD Stream Processor. http://ati.amd.com/products/ streamprocessor/index.html.
[2]
CUDA benchmark suite. http://www.crhc.uiuc.edu/impact/cudabench.html.
[3]
NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.
[4]
The PeakStream platform: High productivity software development for multi-core processors. Technical report, 2006.
[5]
ECE 498AL1: Programming massively parallel processors, Fall 2007. http://courses.ece.uiuc.edu/ece498/al1/.
[6]
J. C. Adams, W. S. Brainerd, J. T. Martin, B. T. Smith, and J. L. Wagener. Fortran 90 handbook: complete ANSI/ISO reference. Intertext Publications, Inc.,/McGraw-Hill, Inc., 1992.
[7]
R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Langugages and Systems, 9(4):491--542, 1987.
[8]
M. J. Atallah, editor. Algorithms and Theory of Computation Handbook. CRC Press LLC, 1998.
[9]
I. Buck. Brook Specification v0.2, October 2003.
[10]
D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. ACM SIGPLAN Notices, 9(4):328--342, 2004.
[11]
K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133--137, 2004.
[12]
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, number 89, 2006.
[13]
K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.
[14]
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, April 1991.
[15]
D. B. Loveman. High Performance Fortran. IEEE Parallel & Distributed Technology: Systems & Technology, 1(1):25--42, 1993.
[16]
W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for programming graphics hardware in a C-like language. In ACM SIGGRAPH 2003 Papers, pages 896--907, 2003.
[17]
M. D. McCool, K. Wadleigh, B. Henderson, and H.-Y. Lin. Performance evaluation of GPUs using the RapidMind development platform. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.
[18]
J. Nickolls and I. Buck. NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum, May 2007.
[19]
OpenMP Architecture Review Board. OpenMP application program interface, May 2005.
[20]
J. Owens. Streaming architectures and technology trends. GPU Gems 2, pages 457--470, March 2005.
[21]
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüuger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.
[22]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, and W. W. Hwu. Program optimization study on a 128-core GPU. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.
[23]
M. Snir, S. W. Otto, D. W. Walker, J. Dongarra, and S. Huss-Lederman. MPI: The Complete Reference. MIT Press, 1995.
[24]
J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618--2640, December 2007.
[25]
S. S. Stone, H. Yi, W. W. Hwu, J. P. Haldar, B. P. Sutton, and Z.-P. Liang. How GPUs can improve the quality of magnetic resonance imaging. In The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.
[26]
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 325--335, 2006.
[27]
P. H.Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 156--166, 2007.
[28]
M. J. Wolfe. Optimizing Supercompilers for Supercomputers. MIT Press, 1990.

Cited By

View all
  • (2024)Backdoor Attack to Giant Model in Fragment-Sharing Federated LearningBig Data Mining and Analytics10.26599/BDMA.2024.90200357:4(1084-1097)Online publication date: Dec-2024
  • (2024)Parallel Computing Technologies and Rendering Optimization in the Problem of Fluid Simulation by the Example of the Incompressible Schrödinger Flow MethodPhysics of Particles and Nuclei10.1134/S106377962403088255:3(519-521)Online publication date: 6-Jun-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
February 2008
308 pages
ISBN:9781595937957
DOI:10.1145/1345206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU computing
  2. parallel computing

Qualifiers

  • Research-article

Conference

PPoPP08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)258
  • Downloads (Last 6 weeks)29
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Backdoor Attack to Giant Model in Fragment-Sharing Federated LearningBig Data Mining and Analytics10.26599/BDMA.2024.90200357:4(1084-1097)Online publication date: Dec-2024
  • (2024)Parallel Computing Technologies and Rendering Optimization in the Problem of Fluid Simulation by the Example of the Incompressible Schrödinger Flow MethodPhysics of Particles and Nuclei10.1134/S106377962403088255:3(519-521)Online publication date: 6-Jun-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2024)Enhanced control of a brain–computer interface by tetraplegic participants via neural-network-mediated feature extractionNature Biomedical Engineering10.1038/s41551-024-01297-1Online publication date: 6-Dec-2024
  • (2024)GPU Accelerated MapReduce-Based Distributed Framework for Knowledge Extraction from Large Uncertain DataSN Computer Science10.1007/s42979-024-03442-85:8Online publication date: 30-Nov-2024
  • (2023)Write What You Want: Applying Text-to-Video Retrieval to Audiovisual ArchivesJournal on Computing and Cultural Heritage 10.1145/362716716:4(1-16)Online publication date: 16-Nov-2023
  • (2023)Living without Beth and Craig: Definitions and Interpolants in Description and Modal Logics with Nominals and Role InclusionsACM Transactions on Computational Logic10.1145/359730124:4(1-51)Online publication date: 10-Oct-2023
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2023)GLARE: Accelerating Sparse DNN Inference Kernels with Global Memory Access Reduction2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363578(1-7)Online publication date: 25-Sep-2023
  • (2023)CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00021(119-130)Online publication date: May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media