research-article

Enabling coordinated register allocation and thread-level parallelism optimization for GPUs

Authors:

Dongrui FanAuthors Info & Claims

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Pages 395 - 406

https://doi.org/10.1145/2830772.2830813

Published: 05 December 2015 Publication History

Get Access

Abstract

The key to high performance on GPUs lies in the massive threading to enable thread switching and hide the latency of function unit and memory access. However, running with the maximum thread-level parallelism (TLP) does not necessarily lead to the optimal performance due to the excessive thread contention for cache resource. As a result, thread throttling techniques are employed to limit the number of threads that concurrently execute to preserve the data locality. On the other hand, GPUs are equipped with a large register file to enable fast context switch between threads. However, thread throttling techniques that are designed to mitigate cache contention, lead to under utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP.

The design space of register allocation and TLP presents new opportunities for performance optimization. However, the complicated correlation between the two factors inevitably lead to many performance dynamics and uncertainties. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism (CRAT), a compiler-based performance optimization framework. In order to achieve this goal, CRAT first enables effective register allocation. Given a register per-thread limit, CRAT allocates the registers by analyzing the lifetime of variables. To reduce the spilling cost, CRAT spills the registers to shared memory when possible. Then, CRAT explores the design space by first pruning the design points that cause serious Ll cache thrashing and register under utilization. After that, CRAT employs a prediction model to find the best tradeoff between the single-thread performance and TLP. We evaluate CRAT using a set of representative workloads on GPUs. Experimental results indicate that compared to the optimal thread throttling technique, our framework achieves performance improvement up to 1.79X (geometric mean 1.25X).

References

[1]

"NVIDIA CUDA programming guide." http://docs.nvidia.com/cuda/cuda-c-programming-guide.

Abstract

References

Cited By

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Exploiting Java instruction/thread level parallelism with horizontal multithreading

Architecture optimization for multimedia application exploiting data and thread-level parallelism

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations