Article

Tiramisu: a polyhedral compiler for expressing fast and portable code

Authors:

Riyadh Baghdadi,

Malek Ben Romdhane,

Emanuele Del Sozzo,

Abdurrahman Akkas,

Patricia Suriana,

Saman AmarasingheAuthors Info & Claims

CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

Pages 193 - 205

Published: 16 February 2019 Publication History

Abstract

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.

References

[1]

Saman P. Amarasinghe and Monica S. Lam. Communication optimization and code generation for distributed memory machines. SIGPLAN Not., 28(6):126–138, June 1993.

Digital Library

[2]

Riyadh Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, A. Betts, J. Ketema, A. F. Donaldson, R. David, and E. Hajiyev. Pencil: a platform-neutral compute intermediate language for accelerator programming. In under review, 2015.

[3]

Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alastair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. Pencil: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT ’15, pages 138–149, Washington, DC, USA, 2015. IEEE Computer Society.

Digital Library

[4]

Riyadh Baghdadi, Albert Cohen, Tobias Grosser, Sven Verdoolaege, Anton Lokhmotov, Javed Absar, Sven van Haastregt, Alexey Kravets, and Alastair F. Donaldson. PENCIL language specification. Research Rep. RR-8706, INRIA, 2015.

[5]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A polyhedral compiler for expressing fast and portable code. ArXiv e-prints, 2019.

[6]

Cédric Bastoul. Code generation in the polyhedral model is easier than you think. In PACT–13 IEEE International Conference on Parallel Architecture and Compilation Techniques, pages 7–16, Juan-les-Pins, France, September 2004. Classement CORE : A, nombre de papiers acceptés : 23, soumis : 122, student award.

Digital Library

[7]

Mohamed-Walid Benabderrahmane, Louis-No¨el Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10. Springer-Verlag, 2010.

Digital Library

[8]

U. Bondhugula. Compiling affine loop nests for distributed-memory parallel architectures. In 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–12, Nov 2013.

Digital Library

[9]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI, pages 101–113, 2008.

Digital Library

[10]

Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. A domain-specific approach to heterogeneous parallelism. In PPoPP, pages 35–46, 2011.

Digital Library

[11]

Chun Chen, Jacqueline Chame, and Mary Hall. Chill: A framework for composing high-level loop transformations. Technical Report 08-897, U. of Southern California, 2008.

[12]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.

[13]

Alexander Collins, Dominik Grewe, Vinod Grover, Sean Lee, and Adriana Susnea. Nova: A functional language for data parallelism. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY’14, pages 8:8–8:13, New York, NY, USA, 2014. ACM.

Digital Library

[14]

Alain Darte and Guillaume Huard. New complexity results on array contraction and related problems. J. VLSI Signal Process. Syst., 40(1):35– 55, May 2005.

Digital Library

[15]

Emanuele Del Sozzo, Riyadh Baghdadi, Saman Amarasinghe, and Marco Domenico Santambrogio. A unified backend for targeting fpgas from dsls. In 2018 IEEE 29th International Conference on Applicationspecific Systems, Architectures and Processors (ASAP), pages 1–8, July 2018.

[16]

Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. Distributed halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, page 5. ACM, 2016.

Digital Library

[17]

William Detmold and Kostas Orginos. Nuclear correlation functions in lattice qcd. Physical Review D, 87(11):114512, 2013.

[18]

P. Feautrier. Array expansion. In Proceedings of the 2nd international conference on Supercomputing, pages 429–441, St. Malo, France, 1988. ACM.

Digital Library

[19]

Paul Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23–53, February 1991.

Digital Library

[20]

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 34(3):261–317, 2006.

Digital Library

[21]

Kazushige Goto and Robert A. van de Geijn. Anatomy of highperformance matrix multiplication. ACM Trans. Math. Softw., 34(3):12:1– 12:25, May 2008.

Digital Library

[22]

Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. Hybrid hexagonal/classical tiling for gpus. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 66:66–66:75, New York, NY, USA, 2014. ACM.

Digital Library

[23]

Tobias Grosser, Armin Groslinger, and Christian Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(4), 2012.

[24]

M. Gupta. On privatization of variables for data-parallel execution. In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages 533–541. IEEE, 1997.

Digital Library

[25]

Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. Loop Transformation Recipes for Code Generation and Auto-Tuning, pages 50–64. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

Digital Library

[26]

Wei Huang, Gopalakrishnan Santhanaraman, H-W Jin, Qi Gao, and Dhabaleswar K Panda. Design of high performance mvapich2: Mpi2 over infiniband. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, volume 1, pages 43–48. IEEE, 2006.

Digital Library

[27]

Intel, Inc. Intel math kernel library. https://software.intel.com/en-us/mkl, April 2018.

[28]

F. Irigoin and R. Triolet. Supernode partitioning. In Symp. on Principles of Programming Languages (POPL’88), pages 319–328, San Diego, CA, January 1988.

Digital Library

[29]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235–244, June 2007.

Digital Library

[30]

Vincent Lefebvre and Paul Feautrier. Automatic storage management for parallel programs. Parallel Computing, 24:649–671, 1998.

Digital Library

[31]

Zhiyuan Li. Array privatization for parallel execution of loops. In Proceedings of the 6th international conference on Supercomputing, pages 313–322, Washington, D. C., United States, 1992. ACM.

Digital Library

[32]

D Maydan, S Amarsinghe, and M Lam. Data dependence and dataflow analysis of arrays. In International Workshop on Languages and Compilers for Parallel Computing, pages 434–448. Springer, 1992.

Digital Library

[33]

Dror E. Maydan, Saman P. Amarasinghe, and Monica S. Lam. Array-data flow analysis and its use in array privatization. In Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages - POPL ’93, pages 2–15, Charleston, South Carolina, United States, 1993.

Digital Library

[34]

Samuel Midkiff. Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Morgan & Claypool Publishers, February 2012.

Digital Library

[35]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. Polymage: Automatic optimization for image processing pipelines. SIGARCH Comput. Archit. News, 43(1):429–443, March 2015.

Digital Library

[36]

Nvidia. cuBLAS Library User Guide, 2012.

[37]

Feautrier Paul and Lengauer Christian. The polyhedron model. In David Padua, editor, Encyclopedia of Parallel Computing, pages 1581, 1592. Springer, 2011.

[38]

Louis-No¨el Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. Loop transformations: Convexity, pruning and optimization. In 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’11), pages 549–562, Austin, TX, January 2011. ACM Press.

Digital Library

[39]

F. Quilleré and S. Rajopadhye. Optimizing memory usage in the polyhedral model. ACM Trans. on Programming Languages and Systems, 22(5):773–815, September 2000.

Digital Library

[40]

Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4):32:1–32:12, July 2012.

Digital Library

[41]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI, pages 519–530, 2013.

Digital Library

[42]

Lawrence G. Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology. Dept. of Electrical Engineering, 1963.

[43]

Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: A functional data-parallel ir for high-performance gpu code generation. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, pages 74–85, Piscataway, NJ, USA, 2017.

Digital Library

[44]

IEEE Press.

[45]

William Thies, Frédéric Vivien, Jeffrey Sheldon, and Saman Amarasinghe. A unified framework for schedule and storage optimization. In Proc. of the 2001 PLDI Conf., 2001.

Digital Library

[46]

Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjodin, and Ramakrishna Upadrasta. GRAPHITE two years after: First lessons learned from Real-World polyhedral compilation, January 2010.

[47]

Peng Tu and David Padua. Automatic array privatization. In Utpal Banerjee, David Gelernter, Alex Nicolau, and David Padua, editors, Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 500–521. Springer Berlin / Heidelberg, 1994.

Digital Library

[48]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zach DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR, abs/1802.04730, 2018.

[49]

Sven Verdoolaege. isl: An integer set library for the polyhedral model. In ICMS, volume 6327, pages 299–302, 2010.

Digital Library

[50]

Sander Vocke, Henk Corporaal, Roel Jordans, Rosilde Corvino, and Rick Nas. Extending halide to improve software development for imaging dsps. ACM Trans. Archit. Code Optim., 14(3):21:1–21:25, August 2017.

Digital Library

[51]

Michael E Wolf and Monica S Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE transactions on parallel and distributed systems, 2(4):452–471, 1991.

Digital Library

[52]

Qing Yi, Keith Seymour, Haihang You, Richard Vuduc, and Dan Quinlan. POET: Parameterized Optimizations for Empirical Tuning. In Proc. Wkshp. Performance Optimization of High-level Languages and Libraries (POHLL), at IEEE Int’l. Par. Distrib. Processing Symp. (IPDPS), pages 1–8, Long Beach, CA, USA, March 2007.

[53]

Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye. Alphaz: A system for design space exploration in the polyhedral model. In International Workshop on Languages and Compilers for Parallel Computing, pages 17–31. Springer, 2012.

[54]

Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. Graphit: A high-performance graph dsl. Proc. ACM Program. Lang., 2(OOPSLA):121:1–121:30, October 2018.

Digital Library

Cited By

Kwon JMin HEgger B(2024)SENNA: Unified Hardware/Software Space Exploration for Parametrizable Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/370573124:2(1-26)Online publication date: 26-Nov-2024
https://dl.acm.org/doi/10.1145/3705731
Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Huang HChen XZhao J(2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656631
Show More Cited By

Recommendations

Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Performance portable GPU code generation for matrix multiplication
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to ...
Automatic generation of fast BLAS3-GEMM: a portable compiler approach
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

February 2019

286 pages

ISBN:9781728114361

General Chair:
Mahmut Taylan Kandemir
Penn State University, USA
,
Program Chairs:
Alexandra Jimborean
Uppsala University, USA
,
Tipp Moseley
Google, USA

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

IEEE Press

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Check for updates

Badges

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
606
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kwon JMin HEgger B(2024)SENNA: Unified Hardware/Software Space Exploration for Parametrizable Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/370573124:2(1-26)Online publication date: 26-Nov-2024
https://dl.acm.org/doi/10.1145/3705731
Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Huang HChen XZhao J(2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656631
McGowen JDagli IDantam NBelviranli M(2024)Scheduling for Cyber-Physical Systems with Heterogeneous Processing Units under Real-World ConstraintsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656625(298-311)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656625
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Maeng KLucia BRodríguez GSadayappan PSukumaran-Rajam A(2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641564
Bi JGuo QLi XZhao YWen YGuo YZhou EHu XDu ZLi LChen HChen TAamodt TJerger NSwift M(2023)Heron: Automatically Constrained High-Performance Library Generation for Deep Learning AcceleratorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582061(314-328)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582061
Kong MAbu Yosef RRountev ASadayappan PMohror KArnold DBadia R(2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607096
Trümper LBen-Nun TSchaad PCalotoiu AHoefler TGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance OptimizationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593714(50-62)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593714
Won JMendis CEmer JAmarasinghe SAamodt TJerger NSwift M(2023)WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor ProgramProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575742(920-934)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575742
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten