Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2738600.2738603acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Characterizing and enhancing global memory data coalescing on GPUs

Published: 07 February 2015 Publication History

Abstract

Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where non-coalesced data access occurs, and assistance in fixing the problem. In this paper, we address both these needs. We develop a two-stage framework where dynamic analysis is first used to detect and characterize uncoalesced accesses in arbitrary PTX programs. Transformations to optimize global memory access by introducing coalesced access are then implemented, using feedback from the dynamic analysis or using a model-driven approach. Experimental results demonstrate the use of the tools on a number of benchmarks from the Rodinia and Polybench suites.

References

[1]
PoCC, the polyhedral compiler collection. http://pocc.sourceforge.net.
[2]
A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986.
[3]
M. Amilkanthwar and S. Balachandran. Cupl: A compile-time uncoalesced memory access pattern locator for cuda. In ICS, pages 459--460. ACM, 2013.
[4]
M. Amini, O. Goubier, S. Guelton, J. O. Mcmahon, F. xavier Pasquier, G. Pan, and P. Villalon. Par4all: From convex array regions to heterogeneous computing. http://www.par4all.org/.
[5]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for gpgpus. In ICS, pages 225--234. ACM, 2008.
[6]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In PPOPP, pages 1--10. ACM, 2008.
[7]
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In CC, pages 244--263. Springer-Verlag, 2010.
[8]
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT, pages 7--16. IEEE Computer Society, 2004.
[9]
M. Boyer, K. Skadron, and W. Weimer. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems, 2008.
[10]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68(10):1370--1380, Oct. 2008.
[11]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44--54. IEEE, 2009.
[12]
S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In IISWC, pages 1--11. IEEE, 2010.
[13]
S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In SC, pages 13:1--13:11. ACM, 2011.
[14]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU, pages 63--74. ACM, 2010.
[15]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 1--12. IEEE Press, 2008.
[16]
P. Feautrier. Dataflow analysis of scalar and array references. Intl. J. of Parallel Programming, 20(1):23--53, Feb. 1991.
[17]
P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. Intl. J. of Parallel Programming, 21(6):389--420, Dec. 1992.
[18]
Georgia Institute of Technology. GPUOcelot. https://code.google.com/p/gpuocelot.
[19]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In InPar, pages 1--10. IEEE, 2012.
[20]
T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for gpus: Automatic parallelization using trapezoidal tiles. In GPGPU-6, pages 24--31. ACM, 2013.
[21]
T. Han and T. Abdelrahman. hicuda: High-level gpgpu programming. Parallel and Distributed Systems, IEEE Transactions on, 22(1):78--90, Jan 2011.
[22]
S. Lee and R. Eigenmann. Openmpc: Extended openmp programming and tuning for gpus. In SC, pages 1--11. IEEE Computer Society, 2010.
[23]
S. Lee and R. Eigenmann. Openmpc: Extended openmp for efficient programming and tuning on gpus. Int. J. Computational Science and Engineering, 7(1):116, 2012.
[24]
S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: A compiler framework for automatic translation and optimization. SIGPLAN Not., 44(4):101--110, Feb. 2009.
[25]
Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus. In ICCS, pages 884--892. Springer-Verlag, 2009.
[26]
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214. ACM, 1997.
[27]
A. Magni, C. Dubach, and M. F. P. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In SC, pages 11:1--11:11. ACM, 2013.
[28]
R. Nath, S. Tomov, T. T. Dong, and J. Dongarra. Optimizing symmetric dense matrix-vector multiplication on gpus. In SC, pages 6:1--6:10. ACM, 2011.
[29]
A. Nukada and S. Matsuoka. Auto-tuning 3-d fft library for cuda gpus. In SC, pages 1--10. ACM, 2009.
[30]
NVIDIA Corporation. Parallel Thread Execution ISA.
[31]
NVIDIA Corporation. NVIDIA CUDA C Programming Guide, June 2011.
[32]
L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In POPL, pages 549--562. ACM, 2011.
[33]
S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu. Languages and compilers for parallel computing. chapter CUDA-Lite: Reducing GPU Programming Complexity, pages 1--15. Springer-Verlag, 2008.
[34]
University of Illinois Urbana-Champaign. Clang. http://clang.llvm.org.
[35]
A. Venkat, M. Shantharam, M. Hall, and M. M. Strout. Non-affine extensions to polyhedral code generation. In CGO, pages 185:185--185:194. ACM, 2014.
[36]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9(4): 54:1--54:23, Jan. 2013.
[37]
B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In PPoPP, pages 57--68. ACM, 2013.
[38]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In PLDI, pages 86--97. ACM, 2010.
[39]
M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. Grace: A low-overhead mechanism for detecting data races in gpu programs. SIGPLAN Not., 46(8):135--146, Feb. 2011.
[40]
M. Zheng, V. Ravi, W. Ma, F. Qin, and G. Agrawal. Gmprof: A low-overhead, fine-grained profiling approach for gpu programs. In HiPC, pages 1--10. IEEE, 2012.

Cited By

View all
  • (2020)Design of Rapid Image Mosaic Based on CUDA by 100-Megapixel Optical SystemProceedings of the 4th International Conference on Computer Science and Application Engineering10.1145/3424978.3425105(1-6)Online publication date: 20-Oct-2020
  • (2019)A static analytical performance model for GPU kernelInternational Journal of Computational Science and Engineering10.5555/3319216.331922618:2(201-210)Online publication date: 1-Jan-2019
  • (2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization
February 2015
280 pages
ISBN:9781479981618

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 07 February 2015

Check for updates

Author Tags

  1. GPU
  2. PTX
  3. coalescing
  4. dynamic analysis
  5. locality
  6. polyhedral compilation
  7. program transformation

Qualifiers

  • Research-article

Conference

CGO '15
Sponsor:

Acceptance Rates

CGO '15 Paper Acceptance Rate 24 of 88 submissions, 27%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Design of Rapid Image Mosaic Based on CUDA by 100-Megapixel Optical SystemProceedings of the 4th International Conference on Computer Science and Application Engineering10.1145/3424978.3425105(1-6)Online publication date: 20-Oct-2020
  • (2019)A static analytical performance model for GPU kernelInternational Journal of Computational Science and Engineering10.5555/3319216.331922618:2(201-210)Online publication date: 1-Jan-2019
  • (2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
  • (2019)A Survey on Agent-based Simulation Using Hardware AcceleratorsACM Computing Surveys10.1145/329104851:6(1-35)Online publication date: 28-Jan-2019
  • (2018)Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205298(86-95)Online publication date: 12-Jun-2018
  • (2018)Bulk-synchronous parallel simultaneous BVH traversal for collision detection on GPUsProceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games10.1145/3190834.3190848(1-9)Online publication date: 15-May-2018
  • (2017)Optimized two-level parallelization for GPU accelerators using the polyhedral modelProceedings of the 26th International Conference on Compiler Construction10.1145/3033019.3033022(22-33)Online publication date: 5-Feb-2017
  • (2016)Concurrent Dynamic Memory Coalescing on GoblinCore-64 ArchitectureProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989128(177-187)Online publication date: 3-Oct-2016
  • (2016)Energy and Performance Prediction of CUDA Applications using Dynamic Regression ModelsProceedings of the 9th India Software Engineering Conference10.1145/2856636.2856643(37-47)Online publication date: 18-Feb-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media