article

TOAST: Automatic tiling for iterative stencil computations on GPUs

Authors:

Rodrigo C. O. Rocha,

Alyson D. Pereira,

Luís F. W. GóesAuthors Info & Claims

Concurrency and Computation: Practice & Experience, Volume 29, Issue 8

Page n/a

https://doi.org/10.1002/cpe.4053

Published: 25 April 2017 Publication History

Abstract

The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units GPUs. In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time-consuming, and error-prone. In this paper, we propose transparently optimized automatic stencil tiling TOAST, an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: 1 It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; 2 it offers a virtualized GPU memory for stencil computations, allowing for large input data; and 3 it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13ï ź×ï ź compared with their multithreaded central processing unit-based optimized versions and up to 48ï ź×ï ź compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.

References

[1]

Holewinski J, Pouchet L-N, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. ACM ICS , New York, NY, USA; 2012: pp.311-320.

[2]

Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. ACM/IEEE SC , Austin, TX, USA; 2008: pp.4:1-4:12.

[3]

Han D, Xu S, Chen L, Huang L. PADS: a pattern-driven stencil compiler-based tool for reuse of optimizations on GPGPUs. IEEE ICPADS , Tainan, Taiwan; 2011: pp.308-315.

[4]

Kamil S, Chan C, Oliker L, Shalf J, Williams S. An auto-tuning framework for parallel multicore stencil computations. IEEE IPDPS , Atlanta, GA, USA; 2010: pp.1-12.

[5]

Mametjanov A, Lowell D, Ma C-C, Norris B. Autotuning stencil-based computations on GPUs. IEEE Cluster , Beijing, China; 2012: pp.266-274.

[6]

Maruyama N, Nomura T, Sato K, Matsuoka S. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. ACM/IEEE SC , Seattle, WA, USA; 2011: pp.11:1-11:12.

[7]

Qadeer W, Hameed R, Shacham O, Venkatesan P, Kozyrakis C, Horowitz MA. Convolution engine: balancing efficiency & flexibility in specialized computing. ISCA , Tel-Aviv, Israel; 2013: pp.24-35.

[8]

Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. ACM PLDI , San Diego, CA, USA; 2007: pp.235-244.

[9]

Renganarayana L, Harthikote-Matha M, Dewri R, Rajopadhye S. Towards optimal multi-level tiling for stencil computations. IEEE IPDPS , Long Beach, CA, USA; 2007: pp.1-10.

[10]

Bertolacci IJ, Olschanowsky C, Harshbarger B, Chamberlain BL, Wonnacott DG, Strout MM. Parameterized diamond tiling for stencil computations with chapel parallel iterators. ACM ICS , Newport Beach, CA, USA; 2015: pp.197-206.

[11]

Frigo M, Strumpen V. Cache oblivious stencil computations. ACM ICS , Cambridge, MA, USA; 2005: pp.361-366.

[12]

Bandishti V, Pananilath I, Bondhugula U. Tiling stencil computations to maximize parallelism. ACM/IEEE SC , Salt Lake City, UT, USA; 2012: pp.1-11.

[13]

Li Z, Song Y. Automatic tiling of iterative stencil loops. ACM Trans Program Lang Syst. 2004 ;Volume 26 Issue 6: pp.975-1028.

[14]

Orozco D, Garcia E, Gao G. Locality optimization of stencil applications using data dependency graphs. LCPC , Houston, TX, USA; 2011: pp.77-91.

[15]

Grosser T, Cohen A, Holewinski J, Sadayappan P, Verdoolaege S. Hybrid hexagonal/classical tiling for GPUs. IEEE/ACM CGO , Orlando, FL, USA; 2014: pp.66-75.

[16]

Meng J, Skadron K. A performance study for iterative stencil loops on GPUs with ghost zone optimizations. Int J Parallel Program. 2011 ;Volume 39 Issue 1: pp.115-142.

[17]

Gysi T, Grosser T, Hoefler T. Modesto: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. ACM ICS , Newport Beach, CA, USA; 2015: pp.177-186.

[18]

Pereira AD, Ramos L, Góes LFW. PSkel: a stencil programming framework for CPU-GPU systems. Concurrency Computat Pract Exper. 2015; Volume 27 Issue 17: pp.4938-4953.

[19]

da¿Silva AR, Gouvêa MMJr. Cloud dynamics simulation with cellular automata. SCSC , Ottawa, ON, Canada; 2010: pp.278-283.

[20]

Gardner M. Mathematical games-the fantastic combinations of John Conway's new solitaire game 'life'. Sci Am. 1970 ;Volume 223 Issue 3: pp.120-123.

[21]

Demmel JW. Applied Numerical Linear Algebra. Philadelphia, PA, USA: SIAM; 1997.

[22]

Christen M, Schenk O, Burkhart H. Patus: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. IEEE IPDPS , Anchorage, AK, USA; 2011: pp.676-687.

[23]

McCool MD. Structured parallel programming with deterministic patterns. USENIX HotPar , Berkeley, CA, USA; 2010: pp.5.

[24]

Lutz T, Fensch C, Cole M. PARTANS: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans Archit Code Optim. 2013 ;Volume 9 Issue 4: pp.59:1-59:24.

[25]

Rosten E, Drummond T. Fusing points and lines for high performance tracking. IEEE ICCV , Beijing, China; 2005: pp.1508-1515.

[26]

Rosten E, Drummond T. Machine learning for high-speed corner detection. ECCV , Graz, Austria; 2006: pp.430-443.

[27]

Freitag M. Using a dynamic schedule to increase the performance of tiling in stencil computations. IEEE GSC , Magdeburg, Germany; 2014: pp.45-48.

[28]

Rahman SMF, Yi Q, Qasem A. Understanding stencil code performance on multicore architectures. ACM CF , Ischia, Italy; 2011: pp.30:1-30:10.

[29]

Zhou X, Giacalone J-P, Garzarán MJ, Kuhn RH, Ni Y, Padua D. Hierarchical overlapped tiling. ACM CGO , San Jose,CA,USA; 2012: pp.207-218.

[30]

Henretty T, Veras R, Franchetti F, Pouchet L-N, Ramanujam J, Sadayappan P. A stencil compiler for short-vector SIMD architectures. ACM ICS , Eugene, OR, USA; 2013: pp.13-24.

[31]

Verdoolaege S, Carlos¿Juega J, Cohen A, Ignacio¿Gómez J, Tenllado C, Catthoor F. Polyhedral parallel code generation for CUDA. ACM Trans Archit Code Optim. 2013 ;Volume 9 Issue 4: pp.54:1-54:23.

[32]

Betkaoui B, Thomas DB, Luk W. Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing. IEEE ICFPT , Beijing, China; 2010: pp.94-101.

[33]

Back T, Schwefel H-P. Evolutionary computation: an overview. IEEE CEC , Nagoya, Japan; 1996: pp.20-29.

[34]

Sundqvist H, Berge E, Kristjánsson JE. Condensation and cloud parameterization studies with a mesoscale numerical weather prediction model. AMS Mon Wea Rev. 1989 ;Volume 117 Issue 8: pp.1641-1657.

Cited By

Afonso SAcosta AAlmeida F(2019)High-performance code optimizations for mobile devicesThe Journal of Supercomputing10.1007/s11227-018-2638-575:3(1382-1395)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11227-018-2638-5
Seyfari YLotfi SKarimpour J(2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.5555/3288339.328836574:10(5432-5460)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.5555/3288339.3288365
Seyfari YLotfi SKarimpour J(2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.1007/s11227-018-2443-174:10(5432-5460)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s11227-018-2443-1
Show More Cited By

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Optimizing symmetric dense matrix-vector multiplication on GPUs
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector ...
GeST: Generalized Stencil Auto-tuning Framework on GPUs
ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

Stencil computations are widely used in high performance computing (HPC) applications. In recent years, stencils have become more diverse in terms of stencil order, memory accesses, and computation patterns. To adapt diverse stencils to GPUs, a variety ...

Comments

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience

Concurrency and Computation: Practice & Experience Volume 29, Issue 8

April 2017

ISSN:1532-0626

Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 25 April 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Afonso SAcosta AAlmeida F(2019)High-performance code optimizations for mobile devicesThe Journal of Supercomputing10.1007/s11227-018-2638-575:3(1382-1395)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11227-018-2638-5
Seyfari YLotfi SKarimpour J(2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.5555/3288339.328836574:10(5432-5460)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.5555/3288339.3288365
Seyfari YLotfi SKarimpour J(2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.1007/s11227-018-2443-174:10(5432-5460)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s11227-018-2443-1
Podestá Edo Nascimento BCastro M(2018)Energy Efficient Stencil Computations on the Low-Power Manycore MPPA-256 ProcessorEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_46(642-655)Online publication date: 27-Aug-2018
https://dl.acm.org/doi/10.1007/978-3-319-96983-1_46
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents