Article

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Authors:

Mayank Daga,

Thomas Scogland,

Wu-chun FengAuthors Info & Claims

ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

Pages 316 - 323

https://doi.org/10.1109/ICPADS.2011.29

Published: 07 December 2011 Publication History

Abstract

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a four-fold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU.

Cited By

View all

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Sathre PGardner MFeng W(2019)On the Portability of CPU-Accelerated Applications via Automated Source-to-Source TranslationProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293338(1-8)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293338
Tristram DHughes DBradshaw K(2014)Accelerating a hydrological uncertainty ensemble model using graphics processing units (GPUs)Computers & Geosciences10.5555/2745549.274566162:C(178-186)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2745549.2745661
Show More Cited By

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
A Performance Model for GPUs with Caches
To exploit the abundant computational power of the world's fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, ...
Evaluation of GPU Architectures Using Spiking Neural Networks
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia'...

Comments

Information & Contributors

Information

Published In

ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

December 2011

1069 pages

ISBN:9780769545769

Publisher

IEEE Computer Society

United States

Publication History

Published: 07 December 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Sathre PGardner MFeng W(2019)On the Portability of CPU-Accelerated Applications via Automated Source-to-Source TranslationProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293338(1-8)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293338
Tristram DHughes DBradshaw K(2014)Accelerating a hydrological uncertainty ensemble model using graphics processing units (GPUs)Computers & Geosciences10.5555/2745549.274566162:C(178-186)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2745549.2745661
Scogland TFeng WRountree BSupinski B(2014)CoreTSARProceedings of the 29th International Conference on Supercomputing - Volume 848810.1007/978-3-319-07518-1_11(172-186)Online publication date: 22-Jun-2014
https://dl.acm.org/doi/10.1007/978-3-319-07518-1_11
Tristram DBradshaw KMcNeill JBradshaw K(2013)Evaluating the acceleration of typical scientific problems on the GPUProceedings of the South African Institute for Computer Scientists and Information Technologists Conference10.1145/2513456.2513473(17-26)Online publication date: 7-Oct-2013
https://dl.acm.org/doi/10.1145/2513456.2513473
Gardner MSathre PFeng WMartinez G(2013)Characterizing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translatorParallel Computing10.1016/j.parco.2013.09.00339:12(769-786)Online publication date: 1-Dec-2013
https://dl.acm.org/doi/10.1016/j.parco.2013.09.003
Albayrak OAkturk IOzturk O(2013)Improving application behavior on heterogeneous manycore systems through kernel mappingParallel Computing10.1016/j.parco.2013.08.01139:12(867-878)Online publication date: 1-Dec-2013
https://dl.acm.org/doi/10.1016/j.parco.2013.08.011

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

A Performance Model for GPUs with Caches

Evaluation of GPU Architectures Using Spiking Neural Networks

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations