Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3314872.3314884acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs

Published: 16 February 2019 Publication History

Abstract

Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. Portability becomes even more difficult when new low-level instructions are added to the ISA (e.g., warp shuffle instructions) or the microarchitectural support for existing instructions is improved (e.g., atomic instructions). Library developers, besides re-tuning the code for new hardware features, deal with the performance portability issue by hand-writing multiple algorithm versions that leverage different instruction sets and microarchitectures. High-level programming frameworks and Domain Specific Languages (DSLs) do not typically support low-level instructions (e.g., warp shuffle and atomic instructions), so it is painful or even impossible for these programming systems to take advantage of the latest architectural improvements. In this work, we design a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. Our transformations enable warp shuffle instructions and atomic instructions (on global and shared memories) to be easily generated. We show a practical implementation of these transformations by building on Tangram, a high-level kernel synthesis framework. Using our new language and compiler extensions, we implement parallel reduction, a fundamental building block used in a wide range of algorithms. Parallel reduction is representative of the performance portability challenge, as its performance heavily depends on the latest hardware improvements. We compare our synthesized parallel reduction to another high-level programming framework and a hand-written high-performance library across three generations of GPU architectures, and show up to 7.8x speedup (2x on average) over hand-written code.

References

[1]
E. Strohmaier, J. Dongarra, S. Horst, and M. Meuer, “Top500 List June 2018,” https://www.top500.org/lists/2018/06/.
[2]
F. Wu and T. Scogland, “Green500 List June 2018,” https:// www.top500.org/green500/lists/2018/06/.
[3]
RightScale, “Rightscale 2018 state of the cloud report,” https:// assets.rightscale.com/uploads/pdfs/RightScale-2018-State-of-the-Cloud-Report.pdf.
[4]
NVIDIA, “CUDA zone,” https://developer.nvidia.com/cuda-zone.
[5]
Khronos Group, “OpenCL 2.0 API specification,” https:// www.khronos.org/registry/cl/specs/opencl-2.0.pdf, 2014.
[6]
N. Bell and J. Hoberock, “Thrust: A productivity-oriented library for cuda,” in GPU computing gems Jade edition, 2011.
[7]
D. Merrill and NVIDIA-Labs, “CUDA unbound (CUB) library,” NVIDIA-Labs, 2015.
[8]
L.-W. Chang, I. E. Hajj, C. Rodrigues, J. G´omez-Luna, and W.-m. Hwu, “Efficient kernel synthesis for performance portable programming,” in MICRO, 2016.
[9]
H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance portability through polymorphic memory access patterns,” Journal of Parallel and Distributed Computing, 2014.
[10]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” in PLDI, 2013.
[11]
J. G´omez-Luna, L.-W. Chang, I.-J. Sung, W.-M. Hwu, and N. Guil, “In-place data sliding algorithms for many-core architectures,” in ICPP, 2015.
[12]
J. G´omez-Luna, J. M. González-Linares, J. I. Benavides, and N. Guil, “An optimized approach to histogram computation on GPU,” Machine Vision and Applications, 2013.
[13]
J. G´omez-Luna, J. M. González-Linares, J. I. Benavides, and N. Guil, “Performance modeling of atomic additions on GPU scratchpad memory,” IEEE Transactions on Parallel and Distributed Systems, 2013.
[14]
S. Yan, G. Long, and Y. Zhang, “Streamscan: Fast scan algorithms for GPUs without global barrier synchronization,” in PPoPP, 2013.
[15]
L.-W. Chang, I. El Hajj, H.-S. Kim, J. G´omez-Luna, A. Dakkak, and W.-m. Hwu, “A programming system for future proofing performance critical libraries,” in PPoPP, 2016.
[16]
NVIDIA, “PTX: Parallel thread execution ISA version 6.2,” https://docs.nvidia.com/cuda/parallel-thread-execution, 2018.
[17]
L. Nyland and S. Jones, “Understanding and using atomic memory operations,” NVIDIA GTC, 2013.
[18]
J. Demouth, “Shuffle: Tips and tricks,” NVIDIA GTC, 2013.
[19]
NVIDIA, “NVIDIA next generation CUDA compute architecture: Kepler GK110 whitepaper,” 2013.
[20]
NVIDIA, “CUDA C programming guide,” https://docs.nvidia.com/ cuda/cuda-c-programming-guide/index.html, 2018.
[21]
D. B. Kirk and W.-M. W. Hwu, Programming massively parallel processors: a hands-on approach. 2016.
[22]
R. Nasre, M. Burtscher, and K. Pingali, “Atomic-free irregular computations on gpus,” in GPGPU, 2013.
[23]
NVIDIA, “Fermi whitepaper,” 2012.
[24]
NVIDIA, “NVIDIA GeForce GTX 980: Featuring Maxwell, the most advanced GPU ever made,” 2014.
[25]
A. Adinets, “CUDA pro tip: Optimized Filtering with Warp-Aggregated Atomics,” 2014.
[26]
NVIDIA, “NVIDIA Tesla P100 GPU,” 2016.
[27]
R. D. Hornung and J. A. Keasler, “The RAJA portability layer: overview and status,” tech. rep., Lawrence Livermore National Lab., 2014.
[28]
M. Steuwer, T. Remmelg, and C. Dubach, “Lift: a functional dataparallel ir for high-performance gpu code generation,” in CGO, 2017.
[29]
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, “Zorua: A holistic approach to resource virtualization in gpus,” in MICRO, 2016.
[30]
N. Vijaykumar, E. Ebrahimi, K. Hsieh, P. B. Gibbons, and O. Mutlu, “The Locality Descriptor: A holistic cross-layer abstraction to express data locality in GPUs,” in ISCA, 2018.
[31]
N. Vijaykumar, A. Jain, D. Majumdar, K. Hsieh, G. Pekhimenko, E. Ebrahimi, N. Hajinazar, P. B. Gibbons, and O. Mutlu, “A case for richer cross-layer abstractions: Bridging the semantic gap with expressive memory,” in ISCA, 2018.
[32]
A. Magni, C. Dubach, and M. O’Boyle, “Automatic optimization of thread-coarsening for graphics processors,” in PACT, 2014.
[33]
L.-W. Chang, H.-S. Kim, and W.-m. W. Hwu, “Dysel: Lightweight dynamic selection for kernel-based data-parallel programming model,” in ASPLOS, 2016.
[34]
G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, “Optimal loop unrolling for GPGPU programs,” in IPDPS, 2010.
[35]
C. Lattner, “LLVM and Clang: Next generation compiler technology,” in The BSD conference, 2008.
[36]
L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” IEEE computational science and engineering, 1998.
[37]
J. Luitjens, “CUDA pro tip: Increase performance with vectorized memory access,” 2013.
[38]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic warp formation and scheduling for efficient GPU control flow,” in MICRO, 2007.
[39]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, “Improving GPU performance via large warps and twolevel warp scheduling,” in MICRO, 2011.
[40]
M. Harris, “Optimizing parallel reduction in CUDA,” NVIDIA CUDA SDK, 2008.
[41]
J. Luitjens, “Faster parallel reductions on Kepler,” NVIDIA, 2014.
[42]
B. Catanzaro, “OpenCL optimization case study: Simple reductions,” 2010.
[43]
M. Steuwer, C. Fensch, S. Lindley, and C. Dubach, “Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code,” in ICFP, 2015.
[44]
P. Suriana, A. Adams, and S. Kamil, “Parallel associative reductions in halide,” in CGO, 2017.
[45]
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, et al., “Pencil: A platform-neutral compute intermediate language for accelerator programming,” in PACT, 2015.
[46]
C. Reddy, M. Kruse, and A. Cohen, “Reduction drawing: Language constructs and polyhedral compilation for reductions on GPU,” in PACT, 2016.

Cited By

View all
  • (2023)Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google's qsimProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624223(1478-1486)Online publication date: 12-Nov-2023
  • (2020)RICHProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392736(1-13)Online publication date: 29-Jun-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization
February 2019
286 pages
ISBN:9781728114361

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 February 2019

Check for updates

Author Tags

  1. Code Generation
  2. Code Transformation
  3. DSL
  4. GPU
  5. Heterogeneity
  6. Paral-lelism
  7. Performance Portability

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google's qsimProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624223(1478-1486)Online publication date: 12-Nov-2023
  • (2020)RICHProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392736(1-13)Online publication date: 29-Jun-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media