research-article

Characterizing and enhancing global memory data coalescing on GPUs

Authors:

Louis-Noël Pouchet,

P. SadayappanAuthors Info & Claims

CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 12 - 22

Published: 07 February 2015 Publication History

Abstract

Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where non-coalesced data access occurs, and assistance in fixing the problem. In this paper, we address both these needs. We develop a two-stage framework where dynamic analysis is first used to detect and characterize uncoalesced accesses in arbitrary PTX programs. Transformations to optimize global memory access by introducing coalesced access are then implemented, using feedback from the dynamic analysis or using a model-driven approach. Experimental results demonstrate the use of the tools on a number of benchmarks from the Rodinia and Polybench suites.

References

[1]

PoCC, the polyhedral compiler collection. http://pocc.sourceforge.net.

[2]

A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986.

Digital Library

[3]

M. Amilkanthwar and S. Balachandran. Cupl: A compile-time uncoalesced memory access pattern locator for cuda. In ICS, pages 459--460. ACM, 2013.

Digital Library

[4]

M. Amini, O. Goubier, S. Guelton, J. O. Mcmahon, F. xavier Pasquier, G. Pan, and P. Villalon. Par4all: From convex array regions to heterogeneous computing. http://www.par4all.org/.

[5]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for gpgpus. In ICS, pages 225--234. ACM, 2008.

Digital Library

[6]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In PPOPP, pages 1--10. ACM, 2008.

Digital Library

[7]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In CC, pages 244--263. Springer-Verlag, 2010.

Digital Library

[8]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT, pages 7--16. IEEE Computer Society, 2004.

Digital Library

[9]

M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008.

[10]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68(10):1370--1380, Oct. 2008.

Digital Library

[11]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44--54. IEEE, 2009.

Digital Library

[12]

S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In IISWC, pages 1--11. IEEE, 2010.

Digital Library

[13]

S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In SC, pages 13:1--13:11. ACM, 2011.

Digital Library

[14]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU, pages 63--74. ACM, 2010.

Digital Library

[15]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 1--12. IEEE Press, 2008.

Digital Library

[16]

P. Feautrier. Dataflow analysis of scalar and array references. Intl. J. of Parallel Programming, 20(1):23--53, Feb. 1991.

[17]

P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. Intl. J. of Parallel Programming, 21(6):389--420, Dec. 1992.

Digital Library

[18]

Georgia Institute of Technology. GPUOcelot. https://code.google.com/p/gpuocelot.

[19]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In InPar, pages 1--10. IEEE, 2012.

[20]

T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for gpus: Automatic parallelization using trapezoidal tiles. In GPGPU-6, pages 24--31. ACM, 2013.

Digital Library

[21]

T. Han and T. Abdelrahman. hicuda: High-level gpgpu programming. Parallel and Distributed Systems, IEEE Transactions on, 22(1):78--90, Jan 2011.

Digital Library

[22]

S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In SC, pages 1--11. IEEE Computer Society, 2010.

Digital Library

[23]

S. Lee and R. Eigenmann. Openmpc: Extended openmp for efficient programming and tuning on gpus. Int. J. Computational Science and Engineering, 7(1):116, 2012.

Digital Library

[24]

S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: A compiler framework for automatic translation and optimization. SIGPLAN Not., 44(4):101--110, Feb. 2009.

Digital Library

[25]

Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus. In ICCS, pages 884--892. Springer-Verlag, 2009.

Digital Library

[26]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214. ACM, 1997.

Digital Library

[27]

A. Magni, C. Dubach, and M. F. P. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In SC, pages 11:1--11:11. ACM, 2013.

Digital Library

[28]

R. Nath, S. Tomov, T. T. Dong, and J. Dongarra. Optimizing symmetric dense matrix-vector multiplication on gpus. In SC, pages 6:1--6:10. ACM, 2011.

Digital Library

[29]

A. Nukada and S. Matsuoka. Auto-tuning 3-d fft library for cuda gpus. In SC, pages 1--10. ACM, 2009.

Digital Library

[30]

NVIDIA Corporation. Parallel Thread Execution ISA.

[31]

NVIDIA Corporation. NVIDIA CUDA C Programming Guide, June 2011.

[32]

L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In POPL, pages 549--562. ACM, 2011.

Digital Library

[33]

S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu. Languages and compilers for parallel computing. chapter CUDA-Lite: Reducing GPU Programming Complexity, pages 1--15. Springer-Verlag, 2008.

Digital Library

[34]

University of Illinois Urbana-Champaign. Clang. http://clang.llvm.org.

[35]

A. Venkat, M. Shantharam, M. Hall, and M. M. Strout. Non-affine extensions to polyhedral code generation. In CGO, pages 185:185--185:194. ACM, 2014.

Digital Library

[36]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9(4): 54:1--54:23, Jan. 2013.

Digital Library

[37]

B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In PPoPP, pages 57--68. ACM, 2013.

Digital Library

[38]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In PLDI, pages 86--97. ACM, 2010.

Digital Library

[39]

M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. Grace: A low-overhead mechanism for detecting data races in gpu programs. SIGPLAN Not., 46(8):135--146, Feb. 2011.

Digital Library

[40]

M. Zheng, V. Ravi, W. Ma, F. Qin, and G. Agrawal. Gmprof: A low-overhead, fine-grained profiling approach for gpu programs. In HiPC, pages 1--10. IEEE, 2012.

Cited By

Tang ZTian YWang ZXu MWang YMa WWang XEmrouznejad AChou J(2020)Design of Rapid Image Mosaic Based on CUDA by 100-Megapixel Optical SystemProceedings of the 4th International Conference on Computer Science and Application Engineering10.1145/3424978.3425105(1-6)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1145/3424978.3425105
(2019)A static analytical performance model for GPU kernelInternational Journal of Computational Science and Engineering10.5555/3319216.331922618:2(201-210)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.5555/3319216.3319226
Wang XTumeo ALeidel JLi JChen Y(2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337867
Show More Cited By

Index Terms

Characterizing and enhancing global memory data coalescing on GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Evolving CUDA PTX programs by quantum inspired linear genetic programming
GECCO '11: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation

The tremendous computing power of Graphics Processing Units (GPUs) can be used to accelerate the evolution process in Genetic Programming (GP). The automatic generation of code using the GPU usually follows two different approaches: compiling each ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 2015

280 pages

ISBN:9781479981618

General Chairs:
Kunle Olukotun
Stanford University
,
Aaron Smith
Microsoft Research
,
Program Chairs:
Robert Hundt
Google
,
Jason Mars
University of Michigan

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
ACM: Association for Computing Machinery
IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Computer Society

United States

Publication History

Published: 07 February 2015

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '15

Sponsor:

SIGPLAN
ACM
SIGMICRO
IEEE-CS\DATC

CGO '15: 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 7 - 11, 2015

California, San Francisco

Acceptance Rates

CGO '15 Paper Acceptance Rate 24 of 88 submissions, 27%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
303
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang ZTian YWang ZXu MWang YMa WWang XEmrouznejad AChou J(2020)Design of Rapid Image Mosaic Based on CUDA by 100-Megapixel Optical SystemProceedings of the 4th International Conference on Computer Science and Application Engineering10.1145/3424978.3425105(1-6)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1145/3424978.3425105
(2019)A static analytical performance model for GPU kernelInternational Journal of Computational Science and Engineering10.5555/3319216.331922618:2(201-210)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.5555/3319216.3319226
Wang XTumeo ALeidel JLi JChen Y(2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337867
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Xiao JAndelfinger PEckhoff DCai WKnoll A(2019)A Survey on Agent-based Simulation Using Hardware AcceleratorsACM Computing Surveys10.1145/329104851:6(1-35)Online publication date: 28-Jan-2019
https://dl.acm.org/doi/10.1145/3291048
Karsin BWeichert VCasanova HIacono JSitchinava N(2018)Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205298(86-95)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205298
Chitalu FDubach CKomura TMcGuire MNowrouzezahari D(2018)Bulk-synchronous parallel simultaneous BVH traversal for collision detection on GPUsProceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games10.1145/3190834.3190848(1-9)Online publication date: 15-May-2018
https://dl.acm.org/doi/10.1145/3190834.3190848
Shirako JHayashi ASarkar VWu PHack S(2017)Optimized two-level parallelization for GPU accelerators using the polyhedral modelProceedings of the 26th International Conference on Compiler Construction10.1145/3033019.3033022(22-33)Online publication date: 5-Feb-2017
https://dl.acm.org/doi/10.1145/3033019.3033022
Wang XLeidel JChen YJacob B(2016)Concurrent Dynamic Memory Coalescing on GoblinCore-64 ArchitectureProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989128(177-187)Online publication date: 3-Oct-2016
https://dl.acm.org/doi/10.1145/2989081.2989128
Benedict SRejitha RAlex SSarkar SSureka ACotroneo DSinha NSinha VVenkatasubramanyam RJoshi PNaik RSingh PLalchandani J(2016)Energy and Performance Prediction of CUDA Applications using Dynamic Regression ModelsProceedings of the 9th India Software Engineering Conference10.1145/2856636.2856643(37-47)Online publication date: 18-Feb-2016
https://dl.acm.org/doi/10.1145/2856636.2856643
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents