Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

TOAST: Automatic tiling for iterative stencil computations on GPUs

Published: 25 April 2017 Publication History

Abstract

The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units GPUs. In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time-consuming, and error-prone. In this paper, we propose transparently optimized automatic stencil tiling TOAST, an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: 1 It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; 2 it offers a virtualized GPU memory for stencil computations, allowing for large input data; and 3 it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13ï ź×ï ź compared with their multithreaded central processing unit-based optimized versions and up to 48ï ź×ï ź compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.

References

[1]
Holewinski J, Pouchet L-N, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. ACM ICS , New York, NY, USA; 2012: pp.311-320.
[2]
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. ACM/IEEE SC , Austin, TX, USA; 2008: pp.4:1-4:12.
[3]
Han D, Xu S, Chen L, Huang L. PADS: a pattern-driven stencil compiler-based tool for reuse of optimizations on GPGPUs. IEEE ICPADS , Tainan, Taiwan; 2011: pp.308-315.
[4]
Kamil S, Chan C, Oliker L, Shalf J, Williams S. An auto-tuning framework for parallel multicore stencil computations. IEEE IPDPS , Atlanta, GA, USA; 2010: pp.1-12.
[5]
Mametjanov A, Lowell D, Ma C-C, Norris B. Autotuning stencil-based computations on GPUs. IEEE Cluster , Beijing, China; 2012: pp.266-274.
[6]
Maruyama N, Nomura T, Sato K, Matsuoka S. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. ACM/IEEE SC , Seattle, WA, USA; 2011: pp.11:1-11:12.
[7]
Qadeer W, Hameed R, Shacham O, Venkatesan P, Kozyrakis C, Horowitz MA. Convolution engine: balancing efficiency & flexibility in specialized computing. ISCA , Tel-Aviv, Israel; 2013: pp.24-35.
[8]
Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. ACM PLDI , San Diego, CA, USA; 2007: pp.235-244.
[9]
Renganarayana L, Harthikote-Matha M, Dewri R, Rajopadhye S. Towards optimal multi-level tiling for stencil computations. IEEE IPDPS , Long Beach, CA, USA; 2007: pp.1-10.
[10]
Bertolacci IJ, Olschanowsky C, Harshbarger B, Chamberlain BL, Wonnacott DG, Strout MM. Parameterized diamond tiling for stencil computations with chapel parallel iterators. ACM ICS , Newport Beach, CA, USA; 2015: pp.197-206.
[11]
Frigo M, Strumpen V. Cache oblivious stencil computations. ACM ICS , Cambridge, MA, USA; 2005: pp.361-366.
[12]
Bandishti V, Pananilath I, Bondhugula U. Tiling stencil computations to maximize parallelism. ACM/IEEE SC , Salt Lake City, UT, USA; 2012: pp.1-11.
[13]
Li Z, Song Y. Automatic tiling of iterative stencil loops. ACM Trans Program Lang Syst. 2004 ;Volume 26 Issue 6: pp.975-1028.
[14]
Orozco D, Garcia E, Gao G. Locality optimization of stencil applications using data dependency graphs. LCPC , Houston, TX, USA; 2011: pp.77-91.
[15]
Grosser T, Cohen A, Holewinski J, Sadayappan P, Verdoolaege S. Hybrid hexagonal/classical tiling for GPUs. IEEE/ACM CGO , Orlando, FL, USA; 2014: pp.66-75.
[16]
Meng J, Skadron K. A performance study for iterative stencil loops on GPUs with ghost zone optimizations. Int J Parallel Program. 2011 ;Volume 39 Issue 1: pp.115-142.
[17]
Gysi T, Grosser T, Hoefler T. Modesto: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. ACM ICS , Newport Beach, CA, USA; 2015: pp.177-186.
[18]
Pereira AD, Ramos L, Góes LFW. PSkel: a stencil programming framework for CPU-GPU systems. Concurrency Computat Pract Exper. 2015; Volume 27 Issue 17: pp.4938-4953.
[19]
da¿Silva AR, Gouvêa MMJr. Cloud dynamics simulation with cellular automata. SCSC , Ottawa, ON, Canada; 2010: pp.278-283.
[20]
Gardner M. Mathematical games-the fantastic combinations of John Conway's new solitaire game 'life'. Sci Am. 1970 ;Volume 223 Issue 3: pp.120-123.
[21]
Demmel JW. Applied Numerical Linear Algebra. Philadelphia, PA, USA: SIAM; 1997.
[22]
Christen M, Schenk O, Burkhart H. Patus: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. IEEE IPDPS , Anchorage, AK, USA; 2011: pp.676-687.
[23]
McCool MD. Structured parallel programming with deterministic patterns. USENIX HotPar , Berkeley, CA, USA; 2010: pp.5.
[24]
Lutz T, Fensch C, Cole M. PARTANS: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans Archit Code Optim. 2013 ;Volume 9 Issue 4: pp.59:1-59:24.
[25]
Rosten E, Drummond T. Fusing points and lines for high performance tracking. IEEE ICCV , Beijing, China; 2005: pp.1508-1515.
[26]
Rosten E, Drummond T. Machine learning for high-speed corner detection. ECCV , Graz, Austria; 2006: pp.430-443.
[27]
Freitag M. Using a dynamic schedule to increase the performance of tiling in stencil computations. IEEE GSC , Magdeburg, Germany; 2014: pp.45-48.
[28]
Rahman SMF, Yi Q, Qasem A. Understanding stencil code performance on multicore architectures. ACM CF , Ischia, Italy; 2011: pp.30:1-30:10.
[29]
Zhou X, Giacalone J-P, Garzarán MJ, Kuhn RH, Ni Y, Padua D. Hierarchical overlapped tiling. ACM CGO , San Jose,CA,USA; 2012: pp.207-218.
[30]
Henretty T, Veras R, Franchetti F, Pouchet L-N, Ramanujam J, Sadayappan P. A stencil compiler for short-vector SIMD architectures. ACM ICS , Eugene, OR, USA; 2013: pp.13-24.
[31]
Verdoolaege S, Carlos¿Juega J, Cohen A, Ignacio¿Gómez J, Tenllado C, Catthoor F. Polyhedral parallel code generation for CUDA. ACM Trans Archit Code Optim. 2013 ;Volume 9 Issue 4: pp.54:1-54:23.
[32]
Betkaoui B, Thomas DB, Luk W. Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing. IEEE ICFPT , Beijing, China; 2010: pp.94-101.
[33]
Back T, Schwefel H-P. Evolutionary computation: an overview. IEEE CEC , Nagoya, Japan; 1996: pp.20-29.
[34]
Sundqvist H, Berge E, Kristjánsson JE. Condensation and cloud parameterization studies with a mesoscale numerical weather prediction model. AMS Mon Wea Rev. 1989 ;Volume 117 Issue 8: pp.1641-1657.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience
Concurrency and Computation: Practice & Experience  Volume 29, Issue 8
April 2017

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 25 April 2017

Author Tags

  1. GPU
  2. autotuning
  3. optimization model
  4. parallel skeletons
  5. stencil computation
  6. tiling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)High-performance code optimizations for mobile devicesThe Journal of Supercomputing10.1007/s11227-018-2638-575:3(1382-1395)Online publication date: 1-Mar-2019
  • (2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.5555/3288339.328836574:10(5432-5460)Online publication date: 1-Oct-2018
  • (2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.1007/s11227-018-2443-174:10(5432-5460)Online publication date: 1-Oct-2018
  • (2018)Energy Efficient Stencil Computations on the Low-Power Manycore MPPA-256 ProcessorEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_46(642-655)Online publication date: 27-Aug-2018
  • (2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media