research-article

On-GPU thread-data remapping for nested branch divergence

Authors:

Cho-Li WangAuthors Info & Claims

Volume 139, Issue C

Pages 75 - 86

https://doi.org/10.1016/j.jpdc.2020.02.003

Published: 01 May 2020 Publication History

Abstract

Nested branches are common in applications with decision trees. The more layers in the branch nest, the larger slowdown is caused by nested branch divergence on GPU. Since inner branches are impractical to evaluate on host end, thread-data remapping via GPU shared memory is so far the most suitable solution. However, existing solution cannot handle inner branches directly due to undefined behavior of GPU barrier function when executed within branch statements. Race condition needs to be prevented without using barrier function. Targeting nested divergence, we propose NeX as a nested extension scheme featuring an inter-thread protocol that supports sub-workgroup synchronization. We further exploit the on-the-fly nature of Head-or-Tail (HoT) algorithm and propose HoT2 with enhanced flexibility of wavefront scheduling. Evaluated on four GPU models including NVIDIA Volta and Turing, HoT2 confirms to be more efficient. For benchmarks with branch nests up to five-layer-deep, NeX further boosts performance by up to 1.56x.

Highlights

•

Inter-thread synchronization can be efficient without GPU barrier function.

•

Software-managed synchronization allows flexible wavefront scheduling.

•

With recursion, On-GPU Thread-Data Remapping reduces divergence more completely.

References

[1]

Adve S.V., Hill M.D., Weak ordering-a new definition, in: Computer Architecture, 1990. Proceedings., 17th Annual International Symposium on, IEEE, 1990, pp. 2–14.

[2]

Alglave J., Batty M., Donaldson A.F., Gopalakrishnan G., Ketema J., Poetzl D., Sorensen T., Wickerson J., GPU concurrency: Weak behaviours and programming assumptions, ACM SIGARCH Comput. Archit. News 43 (1) (2015) 577–591.

[3]

Alsop J., Orr M.S., Beckmann B.M., Wood D.A., Lazy release consistency for GPUs, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Press, 2016, p. 26.

[4]

Ardila Y., Kawai N., Nakamura T., Tamura Y., BEMAP: Benchmark for auto-parallelizer, 2013.

[5]

Breß S., Heimel M., Siegmund N., Bellatreche L., Saake G., GPU-accelerated database systems: Survey and open challenges, in: Transactions on Large-Scale Data-and Knowledge-Centered Systems XV, Springer, 2014, pp. 1–35.

[6]

Brookhouse J., Otero F.E., Kampouridis M., Working with openCL to speed up a genetic programming financial forecasting algorithm: initial results, in: Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation, ACM, 2014, pp. 1117–1124.

[7]

Coutinho B., Sampaio D., Pereira F.M.Q., Meira Jr W., Divergence analysis and optimizations, in: Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, IEEE, 2011, pp. 320–329.

[8]

Fung W.W., Aamodt T.M., Thread block compaction for efficient SIMT control flow, in: High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, IEEE, 2011, pp. 25–36.

[9]

Fung W.W., Sham I., Yuan G., Aamodt T.M., Dynamic warp formation and scheduling for efficient GPU control flow, in: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2007, pp. 407–420.

[10]

Han T.D., Abdelrahman T.S., Reducing branch divergence in GPU programs, in: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ACM, 2011, p. 3.

[11]

Jia Y., Shelhamer E., Donahue J., Karayev S., Long J., Girshick R., Guadarrama S., Darrell T., Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675–678.

[12]

Krommydas K., Feng W.-c., Antonopoulos C.D., Bellas N., Opendwarfs: Characterization of dwarf-based benchmarks on fixed and reconfigurable architectures, J. Signal Process. Syst. (2015) 1–20.

[13]

Lee Y., Grover V., Krashinsky R., Stephenson M., Keckler S.W., Asanović K., Exploring the design space of SPMD divergence management on data-parallel architectures, in: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2014, pp. 101–113.

[14]

Liang Y., Satria M.T., Rupnow K., Chen D., An accurate GPU performance model for effective control flow divergence optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 35 (7) (2016) 1165–1178.

[15]

Lin H., Wang C.-L., Liu H., On-GPU thread-data remapping for branch divergence reduction, ACM Trans. Archit. Code Optim. 15 (3) (2018) 39.

[16]

Meng J., Tarjan D., Skadron K., Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Comput. Archit. News 38 (3) (2010) 235–246.

[17]

P.C. Mills, J.E. Lindholm, B.W. Coon, G.M. Tarolli, J.M. Burgess, Scheduler in multi-threaded processor prioritizing instructions passing qualification rule, Google Patents, US Patent 7,949,855, 2011.

[18]

Nvidia C., Programming guide, 2008.

[19]

Pianu D., Nerino R., Ferraris C., Chimienti A., A novel approach to train random forests on GPU for computer vision applications using local features, Int. J. High Perform. Comput. Appl. 30 (3) (2016) 290–304.

[20]

Rhu M., Erez M., CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures, ACM SIGARCH Comput. Archit. News 40 (3) (2012) 61–71.

[21]

Seo S., Jo G., Lee J., Performance characterization of the NAS Parallel Benchmarks in OpenCL, in: Workload Characterization (IISWC), 2011 IEEE International Symposium on, IEEE, 2011, pp. 137–148.

[22]

Singh A., Aga S., Narayanasamy S., Efficiently enforcing strong memory ordering in GPUs, in: Proceedings of the 48th International Symposium on Microarchitecture, ACM, 2015, pp. 699–712.

[23]

Stone J.E., Gohara D., Shi G., Opencl: A parallel programming standard for heterogeneous computing systems, Comput. Sci. Eng. 12 (1–3) (2010) 66–73.

[24]

Stratton J.A., Rodrigues C., Sung I.-J., Obeid N., Chang L.-W., Anssari N., Liu G.D., Hwu W.-m.W., Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing, Vol. 127, Center for Reliable and High-Performance Computing, 2012.

[25]

Sun W., Ricci R., Fast and flexible: Parallel packet processing with GPUs and click, in: Proceedings of the Ninth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, IEEE Press, 2013, pp. 25–36.

[26]

E.Z. Zhang, Y. Jiang, Z. Guo, X. Shen, Streamlining gpu applications on the fly, in: Proceedings of the ACM International Conference on Supercomputing, ICS, 2010, pp. 115–125.

[27]

Zhang E.Z., Jiang Y., Guo Z., Tian K., Shen X., On-the-fly elimination of dynamic irregularities for GPU computing, ACM SIGARCH Comput. Archit. News 39 (1) (2011) 369–380.

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638

Index Terms

On-GPU thread-data remapping for nested branch divergence

Index terms have been assigned to the content through auto-classification.

Recommendations

On-GPU Thread-Data Remapping for Branch Divergence Reduction

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...
Reducing branch divergence in GPU programs
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a ...
Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 139, Issue C

May 2020

162 pages

ISSN:0743-7315

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 May 2020

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents