Abstract
GPU programming has become popular due to the high computational capabilities of GPUs. Obtaining significant performance gains with GPU is however challenging and the programmer needs to be aware of various subtleties of the GPU architecture. One such subtlety lies in accessing GPU memory, where certain access patterns can lead to poor performance. Such access patterns are referred to as uncoalesced global memory accesses. This work presents a light-weight compile-time static analysis to identify such accesses in GPU programs. The analysis relies on a novel abstraction which tracks the access pattern across multiple threads. The abstraction enables quick prediction while providing correctness guarantees. We have implemented the analysis in LLVM and compare it against a dynamic analysis implementation. The static analysis identifies 95 pre-existing uncoalesced accesses in Rodinia, a popular benchmark suite of GPU programs, and finishes within seconds for most programs, in comparison to the dynamic analysis which finds 69 accesses and takes orders of magnitude longer to finish.
Similar content being viewed by others
References
Allen JR, Kennedy K, Porterfield C, Warren J (1983) Conversion of control dependence to data dependence. In: Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on principles of programming languages, POPL ’83. ACM, New York, NY, USA, pp 177–189. https://doi.org/10.1145/567067.567085
Amilkanthwar M, Balachandran, S (2013) CUPL: A compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th international ACM conference on international conference on supercomputing, ICS ’13. ACM, New York, NY, USA, pp 459–460. https://doi.org/10.1145/2464996.2467288
Baskaran MM, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P (2008) A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the 22Nd annual international conference on supercomputing, ICS ’08. ACM, New York, NY, USA, pp 225–234. https://doi.org/10.1145/1375527.1375562
Betts A, Chong N, Donaldson A, Qadeer S, Thomson P (2012) GPUVerify: a verifier for GPU kernels. SIGPLAN Notice 47(10):113–132. https://doi.org/10.1145/2398857.2384625
Betts A, Chong N, Donaldson AF, Ketema J, Qadeer S, Thomson P, Wickerson J (2015) The design and implementation of a verification technique for GPU kernels. ACM Trans Program Lang Syst 37(3):10:1-10:49. https://doi.org/10.1145/2743017
Boyer RS, Elspas B, Levitt KN (1975) SELECT – a formal system for testing and debugging programs by symbolic execution. In: Proceedings of the international conference on reliable software. ACM, New York, NY, USA, pp 234–245. https://doi.org/10.1145/800027.808445
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE international symposium on workload characterization (IISWC), IISWC ’09. IEEE Computer Society, Washington, DC, USA, pp 44–54. https://doi.org/10.1109/IISWC.2009.5306797
Cousot P, Cousot R (1977) Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on principles of programming languages, POPL ’77. ACM, New York, NY, USA, pp 238–252. https://doi.org/10.1145/512950.512973
Fauzia N, Pouchet LN, Sadayappan P (2015) Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM international symposium on code generation and optimization, CGO ’15. IEEE Computer Society, Washington, DC, USA, pp 12–22. http://dl.acm.org/citation.cfm?id=2738600.2738603
Karrenberg R (2015) Automatic SIMD Vectorization of SSA-based Control Flow Graphs. Springer, Berlin
Kim Y, Shrivastava A (2011) CuMAPz: A tool to analyze memory access patterns in CUDA. In: Proceedings of the 48th design automation conference, DAC ’11. ACM, New York, NY, USA, pp 128–133. https://doi.org/10.1145/2024724.2024754
King JC (1975) A new approach to program testing. In: Proceedings of the International Conference on Reliable Software. ACM, New York, NY, USA, pp 228–233. https://doi.org/10.1145/800027.808444
Li G, Gopalakrishnan G (2010) Scalable SMT-based verification of GPU kernel functions. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE ’10. ACM, New York, NY, USA, pp 187–196. https://doi.org/10.1145/1882291.1882320
Li G, Li P, Sawaya G, Gopalakrishnan G, Ghosh I, Rajan SP (2012) GKLEE: Concolic verification and test generation for GPUs. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’12. ACM, New York, NY, USA, pp 215–224. https://doi.org/10.1145/2145816.2145844
Moll S, Hack S (2018) Partial control-flow linearization. In: Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation, PLDI 2018. ACM, New York, NY, USA, pp 543–556. https://doi.org/10.1145/3192366.3192413
Nielson F, Nielson HR, Hankin C (2010) Principles of program analysis. Springer, Cham
Nvidia: CUDA C Programming Guide v9.0. http://docs.nvidia.com/cuda/cuda-c-programming-guide/
Nvidia: Nvidia Performance Analysis Tools. http://developer.nvidia.com/performance-analysis-tools/
Pharr M, Mark WR (2012) ispc: A spmd compiler for high-performance cpu programming. In: 2012 innovative parallel computing (InPar), pp 1–13. https://doi.org/10.1109/InPar.2012.6339601
Sung IJ, Stratton JA, Hwu WMW (2010) Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, PACT ’10. ACM, New York, NY, USA, pp 513–522. https://doi.org/10.1145/1854273.1854336
Ueng SZ, Lathara M, Baghsorkhi SS, Hwu WMW (2008) Languages and compilers for parallel computing. chap. CUDA-Lite: reducing GPU Programming Complexity. Springer, Berlin, pp 1–15. https://doi.org/10.1007/978-3-540-89740-8_1
Wu J, Belevich A, Bendersky E, Heffernan M, Leary C, Pienaar J, Roune B, Springer R, Weng X, Hundt R (2016) Gpucc: An open-source GPGPU compiler. In: Proceedings of the 2016 international symposium on code generation and optimization, CGO ’16. ACM, New York, NY, USA, pp 105–116. https://doi.org/10.1145/2854038.2854041
Yang Y, Xiang P, Kong J, Zhou H (2010) A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 31st ACM SIGPLAN conference on programming language design and implementation, PLDI ’10. ACM, New York, NY, USA, pp 86–97. https://doi.org/10.1145/1806596.1806606
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alur, R., Devietti, J., Leija, O.S.N. et al. Static detection of uncoalesced accesses in GPU programs. Form Methods Syst Des 60, 1–32 (2022). https://doi.org/10.1007/s10703-021-00362-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10703-021-00362-8