research-article

A practical automatic polyhedral parallelizer and locality optimizer

Authors:

Uday Bondhugula,

Albert Hartono,

J. Ramanujam, and

P. SadayappanAuthors Info & Claims

PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2008

Pages 101 - 113

https://doi.org/10.1145/1375581.1375595

Published: 07 June 2008 Publication History

Abstract

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model -- far beyond what is possible by current production compilers. Unlike previous works, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high speedups for local and parallel execution on multi-cores over state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.

References

[1]

PLuTo: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.

[2]

N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loops. IJPP, 29(5), Oct. 2001.

Digital Library

[3]

R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491--542, 1987.

Digital Library

[4]

C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991.

Digital Library

[5]

R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling. IEEE Trans. Par. & Dist. Sys., 14(9):944--960, 2003.

Digital Library

[6]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In IEEE PACT, pages 7--16, Sept. 2004.

Digital Library

[7]

C. Bastoul and P. Feautrier. Improving data locality by chunking. In Intl. Conf. on Compiler Construction (ETAPS CC), pages 320--335, Warsaw, Apr. 2003.

Digital Library

[8]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily-nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, The Ohio State University, May 2007.

[9]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Intl. Conf. on Compiler Construction (ETAPS CC), Apr. 2008.

Digital Library

[10]

U. Bondhugula, J. Ramanujam, and P. Sadayappan. Pluto: A practical and fully automatic polyhedral parallelizer and locality optimizer. Technical Report OSU-CISRC-10/07-TR70, The Ohio State University, Oct. 2007.

[11]

P. Boulet, A. Darte, T. Risset, and Y. Robert. (Pen)-ultimate tiling? Integration, the VLSI Journal, 17(1):33--51, 1994.

Digital Library

[12]

P. Boulet, A. Darte, G.-A. Silber, and F. Vivien. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Computing, 24(3?4):421--444, 1998.

Digital Library

[13]

CLooG: The Chunky Loop Generator. http://www.cloog.org.

[14]

A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache. Facilitating the search for compositions of program transformations. In ACM Intl. Conf. on Supercomputing, pages 151--160, June 2005.

Digital Library

[15]

A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Parallelization. Birkhauser Boston, 2000.

Digital Library

[16]

A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Processing Letters, 7(4):379--392, 1997.

[17]

A. Darte and F. Vivien. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. Intl. J. Parallel Programming, 25(6):447--496, Dec. 1997.

Digital Library

[18]

P. Feautrier. Parametric integer programming. RAIRO Recherche Operationnelle, 22(3):243--268, 1988.

[19]

P. Feautrier. Dataflow analysis of scalar and array references. Intl. J. of Parallel Programming, 20(1):23--53, Feb. 1991.

[20]

P. Feautrier. Some efficient solutions to the affine scheduling problem: I. one-dimensional time. Intl. J. of Parallel Programming, 21(5):313--348, 1992.

Digital Library

[21]

P. Feautrier. Some efficient solutions to the affine scheduling problem. part II. multidimensional time. Intl. J. of Parallel Programming, 21(6):389--420, 1992.

Digital Library

[22]

S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. Intl. J. of Parallel Programming, 34(3):261--317, June 2006.

Digital Library

[23]

G. Goumas, M. Athanasaki, and N. Koziris. Code Generation Methods for Tiling Transformations. J. of Information Science and Engineering, 18(5):667--691, Sep. 2002.

[24]

M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, 2004. Habilitation thesis.

[25]

M. Griebl, C. Lengauer, and S. Wetzel. Code generation in the polytope model. In IEEE PACT, pages 106--111, 1998.

Digital Library

[26]

E. Hodzic and W. Shang. On time optimal supernode shape. IEEE Trans. Par. & Dist. Sys., 13(12):1220--1233, 2002.

Digital Library

[27]

K. Hogstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In SPAA, pages 201--211, 1999.

Digital Library

[28]

F. Irigoin and R. Triolet. Supernode partitioning. In ACM SIGPLAN PoPL, pages 319--329, 1988.

Digital Library

[29]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yellick. Implicit and explicit optimization for stencil computations. In ACM SIGPLAN workshop on Memory Systems Perofmance and Correctness, 2006.

Digital Library

[30]

W. Kelly andW. Pugh. A unifying framework for iteration reordering transformations. Technical Report CS-TR-3430, Dept. of Computer Science, University of Maryland, College Park, 1995.

Digital Library

[31]

W. Kelly, W. Pugh, and E. Rosser. Code generation for multiple mappings. In Intl. Symp. on the frontiers of massively parallel computation, pages 332--341, Feb. 1995.

Digital Library

[32]

D. Kim, L. Renganarayanan, M. Strout, and S. Rajopadhye. Multilevel tiling: ?m? for the price of one. In Supercomputing, 2007.

Digital Library

[33]

H. LeVerge. A note on chernikova?s algorithm. Technical Report Research report 635, IRISA, Feb. 1992.

[34]

W. Li and K. Pingali. A singular loop transformation framework based on non-singular matrices. Intl. J. of Parallel Programming, 22(2):183--205, 1994.

Digital Library

[35]

A. Lim, S. Liao, and M. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In ACM SIGPLAN PPoPP, pages 103--112, 2001.

Digital Library

[36]

A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM Intl. Conf. on Supercomputing, pages 228--237, 1999.

Digital Library

[37]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4):445--475, 1998.

Digital Library

[38]

The LooPo Project - Loop parallelization in the polytope model. http://www.fmi.uni-passau.de/loopo.

[39]

B. Norris, A. Hartono, and W. Gropp. Annotations for performance and productivity. 2007. Preprint ANL/MCS-P1392-0107.

[40]

R. Penrose. A generalized inverse for matrices. Proceedings of the Cambridge Philosophical Society, 51:406--413, 1955.

[41]

PIP: The Parametric Integer Programming Library. http://www.piplib.org.

[42]

PolyLib - A library of polyhedral functions. http://icps.u-strasbg.fr/polylib/.

[43]

S. Pop, A. Cohen, C. Bastoul, S. Girbal, P. Jouvelot, G.-A. Silber, and N. Vasilache. GRAPHITE: Loop optimizations based on the polyhedral model for GCC. In Proc. of the 4th GCC Developper?s summit, Ottawa, Canada, June 2006.

[44]

L.-N. Pouchet, C. Bastoul, J. Cavazos, and A. Cohen. Iterative optimization in the polyhedral model: Part II, multidimensional time. In PLDI?08, Tucson, Arizona, June 2008.

Digital Library

[45]

L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, one-dimensional time. In ACM CGO, Mar. 2007.

Digital Library

[46]

W. Pugh. The omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.

Digital Library

[47]

F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. Intl. J. of Parallel Programming, 28(5):469--498, 2000.

Digital Library

[48]

J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for multicomputers. JPDC, 16(2):108--230, 1992.

[49]

L. Renganarayana, D. Kim, S. Rajopadhye, and M. M. Strout. Parameterized tiled loops for free. In PLDI, pages 405--414, 2007.

Digital Library

[50]

R. Schreiber and J. Dongarra. Automatic blocking of nested loops. Technical report, University of Tennessee, Knoxville, TN, Aug. 1990.

Digital Library

[51]

A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1986.

Digital Library

[52]

Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In PLDI, pages 215--228, 1999.

Digital Library

[53]

N. Vasilache. Scalable Program Optimization Techniques in the Polyhedral Model. PhD thesis, Université de Paris-Sud, INRIA, Futurs, Sept. 2007.

[54]

N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In Intl. Conf. on Compiler Construction (ETAPS CC), pages 185--201, Mar. 2006.

Digital Library

[55]

N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated dependence analysis. In ACM ICS, June 2006.

Digital Library

[56]

R. Whaley, A. Petitet, and J. Dongarra. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing, 2000.

[57]

D. K. Wilde. A library for doing polyhedral operations. Technical Report RR-2157, IRISA, 1993.

[58]

M. Wolf and M. S. Lam. A data locality optimizing algorithm. In ACM SIGPLAN PLDI ?91, pages 30--44, 1991.

Digital Library

[59]

M. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst., 2(4):452--471, 1991.

Digital Library

[60]

J. Xue. Communication-minimal tiling of uniform dependence loops. JPDC, 42(1):42--59, 1997.

Digital Library

[61]

J. Xue. Loop tiling for parallelism. Kluwer Academic Publishers, Norwell, MA, USA, 2000.

Digital Library

[62]

Q. Yi, K. Kennedy, and V. Adve. Transforming complex loop nests for locality. J. of Supercomputing, 27(3):219--264, 2004.

Digital Library

[63]

K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. A. Padua, K. Pingali, P. Stodghill, and P.Wu. A comparison of empirical and model-driven optimization. In PLDI?03, pages 63--76, 2003.

Digital Library

Cited By

Palkowski MGruzewski M(2024)GPT-Driven Source-to-Source Transformation for Generating Compilable Parallel CUDA Code for Nussinov’s AlgorithmElectronics10.3390/electronics1303048813:3(488)Online publication date: 24-Jan-2024
https://doi.org/10.3390/electronics13030488
Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Singhal VSakka LSundararajah KNewton RKulkarni M(2024)Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree TraversalsACM Transactions on Architecture and Code Optimization10.1145/365260521:2(1-25)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3652605
Show More Cited By

Index Terms

A practical automatic polyhedral parallelizer and locality optimizer
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests

Affine transformations have proven to be powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multidimensional affine function can represent a long and complex sequence of simpler transformations. ...
Read More
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Read More
A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2008

396 pages

ISBN:9781595938602

DOI:10.1145/1375581

General Chair:
Rajiv Gupta
University of California, Riverside, USA
,
Program Chair:
Saman Amarasinghe
Massachusetts Institute of Technology, USA

ACM SIGPLAN Notices Volume 43, Issue 6
PLDI '08
June 2008
382 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1379022
Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '08

Sponsor:

PLDI '08: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 7 - 13, 2008

AZ, Tucson, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

875
Total Citations
View Citations
4,165
Total Downloads

Downloads (Last 12 months)316
Downloads (Last 6 weeks)40

Other Metrics

View Author Metrics

Citations

Cited By

Palkowski MGruzewski M(2024)GPT-Driven Source-to-Source Transformation for Generating Compilable Parallel CUDA Code for Nussinov’s AlgorithmElectronics10.3390/electronics1303048813:3(488)Online publication date: 24-Jan-2024
https://doi.org/10.3390/electronics13030488
Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Singhal VSakka LSundararajah KNewton RKulkarni M(2024)Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree TraversalsACM Transactions on Architecture and Code Optimization10.1145/365260521:2(1-25)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3652605
Babalad SShevade SThazhuthaveetil MGovindarajan R(2024)Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core ArchitecturesProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656630(388-399)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656630
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Anthimopoulos TKeramidas GKelefouras VStamoulis I(2024)Register Blocking: An Analytical Modelling Approach for Affine Loop KernelsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649194(71-79)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649194
Gorius JRokicki SDerrien SRodríguez GSadayappan PSukumaran-Rajam A(2024)A Unified Memory Dependency Framework for Speculative High-Level SynthesisProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641581(13-25)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641581
VenkataKeerthy SJain SKalvakuntla UGorantla PChitale RBrevdo ECohen ATrofin MUpadrasta RRodríguez GSadayappan PSukumaran-Rajam A(2024)The Next 700 ML-Enabled Compiler OptimizationsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641580(238-249)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641580
Zhao JXu JDi PNie WHu JYi YYang SGeng ZZhang RLi BGan ZJin X(2024)Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine RelationsACM Transactions on Computer Systems10.1145/363530541:1-4(1-45)Online publication date: 15-Jan-2024
https://dl.acm.org/doi/10.1145/3635305
Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents