Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

On-GPU thread-data remapping for nested branch divergence

Published: 01 May 2020 Publication History

Abstract

Nested branches are common in applications with decision trees. The more layers in the branch nest, the larger slowdown is caused by nested branch divergence on GPU. Since inner branches are impractical to evaluate on host end, thread-data remapping via GPU shared memory is so far the most suitable solution. However, existing solution cannot handle inner branches directly due to undefined behavior of GPU barrier function when executed within branch statements. Race condition needs to be prevented without using barrier function. Targeting nested divergence, we propose NeX as a nested extension scheme featuring an inter-thread protocol that supports sub-workgroup synchronization. We further exploit the on-the-fly nature of Head-or-Tail (HoT) algorithm and propose HoT2 with enhanced flexibility of wavefront scheduling. Evaluated on four GPU models including NVIDIA Volta and Turing, HoT2 confirms to be more efficient. For benchmarks with branch nests up to five-layer-deep, NeX further boosts performance by up to 1.56x.

Highlights

Inter-thread synchronization can be efficient without GPU barrier function.
Software-managed synchronization allows flexible wavefront scheduling.
With recursion, On-GPU Thread-Data Remapping reduces divergence more completely.

References

[1]
Adve S.V., Hill M.D., Weak ordering-a new definition, in: Computer Architecture, 1990. Proceedings., 17th Annual International Symposium on, IEEE, 1990, pp. 2–14.
[2]
Alglave J., Batty M., Donaldson A.F., Gopalakrishnan G., Ketema J., Poetzl D., Sorensen T., Wickerson J., GPU concurrency: Weak behaviours and programming assumptions, ACM SIGARCH Comput. Archit. News 43 (1) (2015) 577–591.
[3]
Alsop J., Orr M.S., Beckmann B.M., Wood D.A., Lazy release consistency for GPUs, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Press, 2016, p. 26.
[4]
Ardila Y., Kawai N., Nakamura T., Tamura Y., BEMAP: Benchmark for auto-parallelizer, 2013.
[5]
Breß S., Heimel M., Siegmund N., Bellatreche L., Saake G., GPU-accelerated database systems: Survey and open challenges, in: Transactions on Large-Scale Data-and Knowledge-Centered Systems XV, Springer, 2014, pp. 1–35.
[6]
Brookhouse J., Otero F.E., Kampouridis M., Working with openCL to speed up a genetic programming financial forecasting algorithm: initial results, in: Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation, ACM, 2014, pp. 1117–1124.
[7]
Coutinho B., Sampaio D., Pereira F.M.Q., Meira Jr W., Divergence analysis and optimizations, in: Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, IEEE, 2011, pp. 320–329.
[8]
Fung W.W., Aamodt T.M., Thread block compaction for efficient SIMT control flow, in: High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, IEEE, 2011, pp. 25–36.
[9]
Fung W.W., Sham I., Yuan G., Aamodt T.M., Dynamic warp formation and scheduling for efficient GPU control flow, in: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2007, pp. 407–420.
[10]
Han T.D., Abdelrahman T.S., Reducing branch divergence in GPU programs, in: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ACM, 2011, p. 3.
[11]
Jia Y., Shelhamer E., Donahue J., Karayev S., Long J., Girshick R., Guadarrama S., Darrell T., Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675–678.
[12]
Krommydas K., Feng W.-c., Antonopoulos C.D., Bellas N., Opendwarfs: Characterization of dwarf-based benchmarks on fixed and reconfigurable architectures, J. Signal Process. Syst. (2015) 1–20.
[13]
Lee Y., Grover V., Krashinsky R., Stephenson M., Keckler S.W., Asanović K., Exploring the design space of SPMD divergence management on data-parallel architectures, in: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2014, pp. 101–113.
[14]
Liang Y., Satria M.T., Rupnow K., Chen D., An accurate GPU performance model for effective control flow divergence optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 35 (7) (2016) 1165–1178.
[15]
Lin H., Wang C.-L., Liu H., On-GPU thread-data remapping for branch divergence reduction, ACM Trans. Archit. Code Optim. 15 (3) (2018) 39.
[16]
Meng J., Tarjan D., Skadron K., Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Comput. Archit. News 38 (3) (2010) 235–246.
[17]
P.C. Mills, J.E. Lindholm, B.W. Coon, G.M. Tarolli, J.M. Burgess, Scheduler in multi-threaded processor prioritizing instructions passing qualification rule, Google Patents, US Patent 7,949,855, 2011.
[18]
Nvidia C., Programming guide, 2008.
[19]
Pianu D., Nerino R., Ferraris C., Chimienti A., A novel approach to train random forests on GPU for computer vision applications using local features, Int. J. High Perform. Comput. Appl. 30 (3) (2016) 290–304.
[20]
Rhu M., Erez M., CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures, ACM SIGARCH Comput. Archit. News 40 (3) (2012) 61–71.
[21]
Seo S., Jo G., Lee J., Performance characterization of the NAS Parallel Benchmarks in OpenCL, in: Workload Characterization (IISWC), 2011 IEEE International Symposium on, IEEE, 2011, pp. 137–148.
[22]
Singh A., Aga S., Narayanasamy S., Efficiently enforcing strong memory ordering in GPUs, in: Proceedings of the 48th International Symposium on Microarchitecture, ACM, 2015, pp. 699–712.
[23]
Stone J.E., Gohara D., Shi G., Opencl: A parallel programming standard for heterogeneous computing systems, Comput. Sci. Eng. 12 (1–3) (2010) 66–73.
[24]
Stratton J.A., Rodrigues C., Sung I.-J., Obeid N., Chang L.-W., Anssari N., Liu G.D., Hwu W.-m.W., Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing, Vol. 127, Center for Reliable and High-Performance Computing, 2012.
[25]
Sun W., Ricci R., Fast and flexible: Parallel packet processing with GPUs and click, in: Proceedings of the Ninth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, IEEE Press, 2013, pp. 25–36.
[26]
E.Z. Zhang, Y. Jiang, Z. Guo, X. Shen, Streamlining gpu applications on the fly, in: Proceedings of the ACM International Conference on Supercomputing, ICS, 2010, pp. 115–125.
[27]
Zhang E.Z., Jiang Y., Guo Z., Tian K., Shen X., On-the-fly elimination of dynamic irregularities for GPU computing, ACM SIGARCH Comput. Archit. News 39 (1) (2011) 369–380.

Cited By

View all
  • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023

Index Terms

  1. On-GPU thread-data remapping for nested branch divergence
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Journal of Parallel and Distributed Computing
        Journal of Parallel and Distributed Computing  Volume 139, Issue C
        May 2020
        162 pages

        Publisher

        Academic Press, Inc.

        United States

        Publication History

        Published: 01 May 2020

        Author Tags

        1. GPGPU
        2. Branch divergence
        3. SIMD
        4. Race condition

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 10 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
        • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media