research-article

A framework for enhancing data reuse via associative reordering

Authors:

Tobias Grosser,

Louis-Noël Pouchet,

Fabrice Rastello,

P. SadayappanAuthors Info & Claims

ACM SIGPLAN Notices, Volume 49, Issue 6

Pages 65 - 76

https://doi.org/10.1145/2666356.2594342

Published: 09 June 2014 Publication History

Abstract

The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers.

In this paper, we develop a novel framework utilizing the associativity and commutativity of operations in regular loop computations to enhance register reuse. Stencils represent a particular class of important computations where the optimization framework can be applied to enhance performance. We show how stencil operations can be implemented to better exploit register reuse and reduce load/stores. We develop a multi-dimensional retiming formalism to characterize the space of valid implementations in conjunction with other program transformations. Experimental results demonstrate the effectiveness of the framework on a collection of high-order stencils.

References

[1]

M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 1964.

Digital Library

[2]

F. Aleen and N. Clark. Commutativity analysis for software parallelization: letting program transformations see the big picture. In ASPLOS, pages 241--252, 2009.

Digital Library

[3]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks - summary and preliminary results. In SC, pages 158--165, 1991.

Digital Library

[4]

J. W. Banks and W. D. Henshaw. Upwind schemes for the wave equation in second-order form. J. Comput. Phys., 231(17):5854--5889, 2012.

Digital Library

[5]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT, pages 7--16, 2004.

Digital Library

[6]

G. E. Blelloch. Scans as primitive parallel operations. IEEE TC, 38 (11):1526--1538, 1989.

Digital Library

[7]

P.-Y. Calland, A. Darte, and Y. Robert. Circuit retiming applied to decomposed software pipelining. IEEE TPDS, 9(1):24--35, 1998.

Digital Library

[8]

Chombo. https://commons.lbl.gov/display/chombo.

[9]

R. Cruz, M. Araya-Polo, and J. Cela. Introducing the semi-stencil algorithm. In PPAM, pages 496--506. 2010.

Digital Library

[10]

A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. PPL, 7(4):379--392, 1997.

[11]

K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS, University of California, Berkeley, 2009.

Digital Library

[12]

S. J. Deitz, B. L. Chamberlain, and L. Snyder. Eliminating redundancies in sum-of-product array computations. In ICS, pages 65--77, 2001.

Digital Library

[13]

Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In ICS, pages 205--213, 2008.

Digital Library

[14]

H. Dursun, M. Kunaseth, K. ichi Nomura, J. Chame, R. F. Lucas, C. Chen, M. W. Hall, R. K. Kalia, A. Nakano, and P. Vashishta. Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. The Journal of Supercomputing, 62(2): 946--966, 2012.

Digital Library

[15]

P. Feautrier. Dataflow analysis of scalar and array references. IJPP, 20(1):23--53, 1991.

[16]

L. Han, W. Liu, and J. Tuck. Speculative parallelization of partial reduction variables. In CGO, pages 141--150, 2010.

Digital Library

[17]

R. Haralick and L. Shapiro. Computer and robot vision. Computer and Robot Vision. Addison-Wesley, 1993.

Digital Library

[18]

T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short simd architectures. In CC, pages 225--245, 2011.

Digital Library

[19]

T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector simd architectures. In ICS, 2013.

Digital Library

[20]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012.

Digital Library

[21]

S. Kim and S.-M. Moon. Rotating register allocation for enhanced pipeline scheduling. In PACT, pages 60--72, 2007.

Digital Library

[22]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet simd code generation. In PLDI, 2013.

Digital Library

[23]

M. Kulkarni, D. Nguyen, D. Prountzos, X. Sui, and K. Pingali. Exploiting the commutativity lattice. In PLDI, pages 542--555, 2011.

Digital Library

[24]

T. Liebig. openEMS - Open Electromagnetic Field Solver. URL http://openEMS.de.

[25]

J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA, pages 19--25, 1995.

[26]

Overture. Overture: An Object-Oriented Toolkit for Solving Partial Differential Equations in Complex Geometry; version 25, 2012. http://www.overtureframework.org/.

[27]

N. L. Passos and E. H.-M. Sha. Achieving full parallelism using multidimensional retiming. IEEE TPDS, 7(11):1150--1163, 1996.

Digital Library

[28]

N. L. Passos, E. H.-M. Sha, and S. C. Bass. Optimizing dsp flow graphs via schedule-based multidimensional retiming. IEEE TSP, 44 (1):150--155, 1996.

Digital Library

[29]

L.-N. Pouchet. PoCC 1.2: the Polyhedral Compiler Collection. http://pocc.sourceforge.net, 2012.

[30]

L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part II, multidimensional time. In PLDI, pages 90--100, 2008.

Digital Library

[31]

L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In POPL, pages 549--562, 2011.

Digital Library

[32]

L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-based data reuse optimization for configurable computing. In FPGA, 2013.

Digital Library

[33]

P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In PLDI, pages 1--11, 2011.

Digital Library

[34]

F. Quilleré, S. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000.

Digital Library

[35]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI, pages 519--530, 2013.

Digital Library

[36]

X. Redon and P. Feautrier. Detection of recurrences in sequential programs with loops. In PARLE, pages 132--145, 1993.

Digital Library

[37]

M. C. Rinard and P. C. Diniz. Commutativity analysis: A new analysis technique for parallelizing compilers. TOPLAS, 19(6):942--991, 1997.

Digital Library

[38]

N. Sedaghati, R. Thomas, L. Pouchet, R. Teodorescu, and P. Sadayappan. StVEC: A vector instruction extension for high performance stencil computation. In PACT, pages 276--287, 2011.

Digital Library

[39]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In GH, pages 97--106, 2007.

Digital Library

[40]

L. T. Simpson. Value-driven Redundancy Elimination. PhD thesis, Houston, TX, USA, 1996.

Digital Library

[41]

N. Vasilache, A. Cohen, and L.-N. Pouchet. Automatic correction of loop transformations. In PACT, pages 292--304, 2007.

Digital Library

[42]

S. Verdoolaege. ISL: An integer set library for the polyhedral model. In Mathematical Software--ICMS 2010, pages 299--302. Springer, 2010.

Digital Library

[43]

H. Weller. OpenFOAM. URL http://www.openfoam.org/.

[44]

S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009.

Digital Library

[45]

Y. Zou and S. Rajopadhye. Scan detection and parallelization in "inherently sequential" nested loop programs. In CGO, pages 74--83, 2012.

Digital Library

Cited By

Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Hückelheim JHascoët L(2022)Source-to-Source Automatic Differentiation of OpenMP Parallel LoopsACM Transactions on Mathematical Software10.1145/347279648:1(1-32)Online publication date: 16-Feb-2022
https://dl.acm.org/doi/10.1145/3472796
Huckelheim JDoerfert J(2021)Spray: Sparse Reductions of Arrays in OPENMP2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00056(475-484)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00056
Show More Cited By

Index Terms

A framework for enhancing data reuse via associative reordering
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Index terms have been assigned to the content through auto-classification.

Recommendations

A framework for enhancing data reuse via associative reordering
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers.

In this paper, we develop a novel framework utilizing the ...
Associative instruction reordering to alleviate register pressure
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Register allocation is generally considered a practically solved problem. For most applications, the register allocation strategies in production compilers are very effective in controlling the number of loads/stores and register spills. However, ...
Associative instruction reordering to alleviate register pressure
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Register allocation is generally considered a practically solved problem. For most applications, the register allocation strategies in production compilers are very effective in controlling the number of loads/stores and register spills. However, ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 49, Issue 6

PLDI '14

June 2014

598 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2666356

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2014
619 pages
ISBN:9781450327848
DOI:10.1145/2594291
General Chair:
Michael O'Boyle
University of Edinburgh
,
Program Chair:
Keshav Pingali
University of Texas, Austin

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2014

Published in SIGPLAN Volume 49, Issue 6

Check for updates

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
794
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)2

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Hückelheim JHascoët L(2022)Source-to-Source Automatic Differentiation of OpenMP Parallel LoopsACM Transactions on Mathematical Software10.1145/347279648:1(1-32)Online publication date: 16-Feb-2022
https://dl.acm.org/doi/10.1145/3472796
Huckelheim JDoerfert J(2021)Spray: Sparse Reductions of Arrays in OPENMP2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00056(475-484)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00056
Hückelheim JKukreja NNarayanan SLuporini FGorman GHovland P(2019)Automatic Differentiation for Adjoint Stencil LoopsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337906(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337906
Zhao TBasu PWilliams SHall MJohansen HTaufer MBalaji PPeña A(2019)Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356210(1-44)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356210
Chi YCong JWei PZhou P(2018)SODA: Stencil with Optimized Dataflow Architecture2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)10.1145/3240765.3240850(1-8)Online publication date: 5-Nov-2018
https://dl.acm.org/doi/10.1145/3240765.3240850
Zhao THall MBasu PWilliams SJohansen H(2018)SIMD code generation for stencils on brick decompositionsACM SIGPLAN Notices10.1145/3200691.317853753:1(423-424)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178537
Zhao THall MBasu PWilliams SJohansen HKrall AGross T(2018)SIMD code generation for stencils on brick decompositionsProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178537(423-424)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178537
Zhao TWilliams SHall MJohansen H(2018)Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC.2018.00009(59-70)Online publication date: Nov-2018
https://doi.org/10.1109/P3HPC.2018.00009
Rawat PVaidya MSukumaran-Rajam ARavishankar MGrover VRountev APouchet LSadayappan P(2018)Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil ComputationsProceedings of the IEEE10.1109/JPROC.2018.2862896106:11(1902-1920)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2862896
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents