research-article

Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Authors:

Tobias Grosser,

Paul H. J. Kelly,

Sven VerdoolaegeAuthors Info & Claims

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Pages 24 - 31

https://doi.org/10.1145/2458523.2458526

Published: 16 March 2013 Publication History

Abstract

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.

One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.

References

[1]

M. Amini, F. Coelho, F. Irigoin, and R. Keryell. Static compilation analysis for host-accelerator communication optimization. In Workshop on Languages and Compilers for Parallel Computing (LCPC'11), LNCS. Springer-Verlag, Oct. 2011.

[2]

M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier, S. Guelton, J. O. McMahon, F. X. Pasquier, G. Péan, and P. Villalon. Par4all: From convex array regions to heterogeneous computing. In IMPACT'12, Paris, France, Jan. 2012.

[3]

V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. IEEE.

Digital Library

[4]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction, CC'10/ETAPS'10, pages 244--263, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[5]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI'08, pages 101--113, New York, NY, USA, 2008. ACM.

Digital Library

[6]

M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE.

Digital Library

[7]

K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. A. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009.

Digital Library

[8]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of SC '08, pages 4:1--4:12, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[9]

P. Di and J. Xue. Model-driven tile size selection for DOACROSS loops on GPUs. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, Euro-Par'11, pages 401--412, Berlin, Heidelberg, 2011. Springer-Verlag.

Digital Library

[10]

D. Han, S. Xu, L. Chen, and L. Huang. Pads: A pattern-driven stencil compiler-based tool for reuse of optimizations on gpgpus. In ICPADS, pages 308--315, 2011.

Digital Library

[11]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 311--320, New York, NY, USA, 2012. ACM.

Digital Library

[12]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI'07, pages 235--244, New York, NY, USA, 2007. ACM.

Digital Library

[13]

J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. International Journal of Parallel Programming, 39(1):115--142, 2011.

[14]

P. Micikevicius. 3d finite difference computation on gpus using cuda. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79--84, New York, NY, USA, 2009. ACM.

Digital Library

[15]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Proceedings of SC '10, pages 1--13, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[16]

The OpenACC standard, 2011.

[17]

M. M. Strout, L. Carter, J. Ferrante, J. Freeman, and B. Kreaseck. Combining performance aspects of irregular gauss-seidel via sparse tiling. In LCPC, pages 90--110, 2002.

Digital Library

[18]

R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache oblivious parallelograms in iterative stencil computations. In ICS, pages 49--59, 2010.

Digital Library

[19]

R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. 2012 41st International Conference on Parallel Processing, 0:571--581, 2011.

Digital Library

[20]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM.

Digital Library

[21]

S. Verdoolaege. Counting affine calculator and applications. In First International Workshop on Polyhedral Compilation Techniques (IMPACT'11), Charmonix, France, Apr. 2011.

[22]

S. Verdoolaege and T. Grosser. Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT'12), Paris, France, Jan. 2012.

[23]

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization(TACO), Dec. 2012. Selected for presentation at the HiPEAC 2013 Conf.

Digital Library

[24]

X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling. In Proceedings of the 10th Intl. Symp. Code Gen. and Opt., CGO '12, pages 207--218, New York, NY, USA, 2012. ACM.

Digital Library

Cited By

Dolp RHanika JDachsbacher C(2024)A Fast GPU Schedule For À-Trous Wavelet-Based DenoisersProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36512997:1(1-18)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3651299
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Zhang YLi KYuan LCheng JZhang YCao TYang M(2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00059
Show More Cited By

Index Terms

Split tiling for GPUs: automatic parallelization using trapezoidal tiles
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Hybrid Hexagonal/Classical Tiling for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade ...
Hybrid Hexagonal/Classical Tiling for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

March 2013

156 pages

ISBN:9781450320177

DOI:10.1145/2458523

Editors:
John Cavazos
University of Delaware
,
Xiang Gong,
David Kaeli

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

GPGPU-6

GPGPU-6: Sixth Workshop on General Purpose Processing Using GPUs

March 16, 2013

Texas, Houston, USA

Acceptance Rates

GPGPU-6 Paper Acceptance Rate 15 of 37 submissions, 41%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
645
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)3

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dolp RHanika JDachsbacher C(2024)A Fast GPU Schedule For À-Trous Wavelet-Based DenoisersProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36512997:1(1-18)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3651299
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Zhang YLi KYuan LCheng JZhang YCao TYang M(2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00059
Shan BAraya-Polo M(2024)Evaluation of Programming Models and Performance for Stencil Computation on GPGPUs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00198(1178-1180)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00198
Wang XChen YLi BBu J(2024)Optimization of weight update for convolutional operators for domestic accelerators2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603451(503-507)Online publication date: 12-Apr-2024
https://doi.org/10.1109/ICCEA62105.2024.10603451
Shan BAraya-Polo MChapman B(2024)Evaluation of Directive-Based Programming Models for Stencil Computation on Current GPGPU ArchitecturesAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_9(126-140)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72567-8_9
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Liu SZhang ZWu W(2023)DHTS: A Dynamic Hybrid Tiling Strategy for Optimizing Stencil Computation on GPUsIEEE Transactions on Computers10.1109/TC.2023.327106072:10(2795-2807)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3271060
Xu RSha EZhuge QSong YWang H(2023)Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMsJournal of Systems Architecture10.1016/j.sysarc.2022.102799135(102799)Online publication date: Feb-2023
https://doi.org/10.1016/j.sysarc.2022.102799
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten