Article

Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

Authors:

Jiayuan Meng,

Thomas D. Uram,

Vitali Morozov,

Venkatram Vishwanath,

Kalyan KumaranAuthors Info & Claims

IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pages 998 - 1007

https://doi.org/10.1109/IPDPSW.2015.55

Published: 25 May 2015 Publication History

Abstract

Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.

Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
1. Theory of computation
  1. Models of computation
    1. Concurrency

Recommendations

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Heterogeneous microprocessors integrate a CPU and GPU on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data “in place.” This permits exploiting a finer granularity of parallelism on the integrated GPUs, and enables ...
Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
GPU Daemon: Road to zero cost submission
IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

In this paper we present a novel approach of utilizing new features of OpenCL 2.0: Fine-Grained SVM and device-side enqueue that allow completely new usage models and application paradigms. We present the idea of a GPU (Graphics Processing Unit) daemon ...

Comments

Information & Contributors

Information

Published In

IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

May 2015

1256 pages

ISBN:9781467376846

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 May 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Recommendations

Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors

Modeling and predicting performance of high performance computing applications on hardware accelerators

GPU Daemon: Road to zero cost submission

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations