Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3314872.3314896acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

Tiramisu: a polyhedral compiler for expressing fast and portable code

Published: 16 February 2019 Publication History

Abstract

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.

References

[1]
Saman P. Amarasinghe and Monica S. Lam. Communication optimization and code generation for distributed memory machines. SIGPLAN Not., 28(6):126–138, June 1993.
[2]
Riyadh Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, A. Betts, J. Ketema, A. F. Donaldson, R. David, and E. Hajiyev. Pencil: a platform-neutral compute intermediate language for accelerator programming. In under review, 2015.
[3]
Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alastair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. Pencil: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT ’15, pages 138–149, Washington, DC, USA, 2015. IEEE Computer Society.
[4]
Riyadh Baghdadi, Albert Cohen, Tobias Grosser, Sven Verdoolaege, Anton Lokhmotov, Javed Absar, Sven van Haastregt, Alexey Kravets, and Alastair F. Donaldson. PENCIL language specification. Research Rep. RR-8706, INRIA, 2015.
[5]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A polyhedral compiler for expressing fast and portable code. ArXiv e-prints, 2019.
[6]
Cédric Bastoul. Code generation in the polyhedral model is easier than you think. In PACT–13 IEEE International Conference on Parallel Architecture and Compilation Techniques, pages 7–16, Juan-les-Pins, France, September 2004. Classement CORE : A, nombre de papiers acceptés : 23, soumis : 122, student award.
[7]
Mohamed-Walid Benabderrahmane, Louis-No¨el Pouchet, Albert Cohen, and Cédric Bastoul. The polyhedral model is more widely applicable than you think. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10. Springer-Verlag, 2010.
[8]
U. Bondhugula. Compiling affine loop nests for distributed-memory parallel architectures. In 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–12, Nov 2013.
[9]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI, pages 101–113, 2008.
[10]
Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. A domain-specific approach to heterogeneous parallelism. In PPoPP, pages 35–46, 2011.
[11]
Chun Chen, Jacqueline Chame, and Mary Hall. Chill: A framework for composing high-level loop transformations. Technical Report 08-897, U. of Southern California, 2008.
[12]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.
[13]
Alexander Collins, Dominik Grewe, Vinod Grover, Sean Lee, and Adriana Susnea. Nova: A functional language for data parallelism. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY’14, pages 8:8–8:13, New York, NY, USA, 2014. ACM.
[14]
Alain Darte and Guillaume Huard. New complexity results on array contraction and related problems. J. VLSI Signal Process. Syst., 40(1):35– 55, May 2005.
[15]
Emanuele Del Sozzo, Riyadh Baghdadi, Saman Amarasinghe, and Marco Domenico Santambrogio. A unified backend for targeting fpgas from dsls. In 2018 IEEE 29th International Conference on Applicationspecific Systems, Architectures and Processors (ASAP), pages 1–8, July 2018.
[16]
Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. Distributed halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, page 5. ACM, 2016.
[17]
William Detmold and Kostas Orginos. Nuclear correlation functions in lattice qcd. Physical Review D, 87(11):114512, 2013.
[18]
P. Feautrier. Array expansion. In Proceedings of the 2nd international conference on Supercomputing, pages 429–441, St. Malo, France, 1988. ACM.
[19]
Paul Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23–53, February 1991.
[20]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 34(3):261–317, 2006.
[21]
Kazushige Goto and Robert A. van de Geijn. Anatomy of highperformance matrix multiplication. ACM Trans. Math. Softw., 34(3):12:1– 12:25, May 2008.
[22]
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. Hybrid hexagonal/classical tiling for gpus. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 66:66–66:75, New York, NY, USA, 2014. ACM.
[23]
Tobias Grosser, Armin Groslinger, and Christian Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(4), 2012.
[24]
M. Gupta. On privatization of variables for data-parallel execution. In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages 533–541. IEEE, 1997.
[25]
Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. Loop Transformation Recipes for Code Generation and Auto-Tuning, pages 50–64. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
[26]
Wei Huang, Gopalakrishnan Santhanaraman, H-W Jin, Qi Gao, and Dhabaleswar K Panda. Design of high performance mvapich2: Mpi2 over infiniband. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, volume 1, pages 43–48. IEEE, 2006.
[27]
Intel, Inc. Intel math kernel library. https://software.intel.com/en-us/mkl, April 2018.
[28]
F. Irigoin and R. Triolet. Supernode partitioning. In Symp. on Principles of Programming Languages (POPL’88), pages 319–328, San Diego, CA, January 1988.
[29]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235–244, June 2007.
[30]
Vincent Lefebvre and Paul Feautrier. Automatic storage management for parallel programs. Parallel Computing, 24:649–671, 1998.
[31]
Zhiyuan Li. Array privatization for parallel execution of loops. In Proceedings of the 6th international conference on Supercomputing, pages 313–322, Washington, D. C., United States, 1992. ACM.
[32]
D Maydan, S Amarsinghe, and M Lam. Data dependence and dataflow analysis of arrays. In International Workshop on Languages and Compilers for Parallel Computing, pages 434–448. Springer, 1992.
[33]
Dror E. Maydan, Saman P. Amarasinghe, and Monica S. Lam. Array-data flow analysis and its use in array privatization. In Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages - POPL ’93, pages 2–15, Charleston, South Carolina, United States, 1993.
[34]
Samuel Midkiff. Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Morgan & Claypool Publishers, February 2012.
[35]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. Polymage: Automatic optimization for image processing pipelines. SIGARCH Comput. Archit. News, 43(1):429–443, March 2015.
[36]
Nvidia. cuBLAS Library User Guide, 2012.
[37]
Feautrier Paul and Lengauer Christian. The polyhedron model. In David Padua, editor, Encyclopedia of Parallel Computing, pages 1581, 1592. Springer, 2011.
[38]
Louis-No¨el Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. Loop transformations: Convexity, pruning and optimization. In 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’11), pages 549–562, Austin, TX, January 2011. ACM Press.
[39]
F. Quilleré and S. Rajopadhye. Optimizing memory usage in the polyhedral model. ACM Trans. on Programming Languages and Systems, 22(5):773–815, September 2000.
[40]
Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4):32:1–32:12, July 2012.
[41]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI, pages 519–530, 2013.
[42]
Lawrence G. Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology. Dept. of Electrical Engineering, 1963.
[43]
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: A functional data-parallel ir for high-performance gpu code generation. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, pages 74–85, Piscataway, NJ, USA, 2017.
[44]
IEEE Press.
[45]
William Thies, Frédéric Vivien, Jeffrey Sheldon, and Saman Amarasinghe. A unified framework for schedule and storage optimization. In Proc. of the 2001 PLDI Conf., 2001.
[46]
Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjodin, and Ramakrishna Upadrasta. GRAPHITE two years after: First lessons learned from Real-World polyhedral compilation, January 2010.
[47]
Peng Tu and David Padua. Automatic array privatization. In Utpal Banerjee, David Gelernter, Alex Nicolau, and David Padua, editors, Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 500–521. Springer Berlin / Heidelberg, 1994.
[48]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zach DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR, abs/1802.04730, 2018.
[49]
Sven Verdoolaege. isl: An integer set library for the polyhedral model. In ICMS, volume 6327, pages 299–302, 2010.
[50]
Sander Vocke, Henk Corporaal, Roel Jordans, Rosilde Corvino, and Rick Nas. Extending halide to improve software development for imaging dsps. ACM Trans. Archit. Code Optim., 14(3):21:1–21:25, August 2017.
[51]
Michael E Wolf and Monica S Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE transactions on parallel and distributed systems, 2(4):452–471, 1991.
[52]
Qing Yi, Keith Seymour, Haihang You, Richard Vuduc, and Dan Quinlan. POET: Parameterized Optimizations for Empirical Tuning. In Proc. Wkshp. Performance Optimization of High-level Languages and Libraries (POHLL), at IEEE Int’l. Par. Distrib. Processing Symp. (IPDPS), pages 1–8, Long Beach, CA, USA, March 2007.
[53]
Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye. Alphaz: A system for design space exploration in the polyhedral model. In International Workshop on Languages and Compilers for Parallel Computing, pages 17–31. Springer, 2012.
[54]
Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. Graphit: A high-performance graph dsl. Proc. ACM Program. Lang., 2(OOPSLA):121:1–121:30, October 2018.

Cited By

View all
  • (2024)SENNA: Unified Hardware/Software Space Exploration for Parametrizable Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/370573124:2(1-26)Online publication date: 26-Nov-2024
  • (2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
  • (2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization
February 2019
286 pages
ISBN:9781728114361

Sponsors

Publisher

IEEE Press

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Check for updates

Badges

Author Tags

  1. Code Generation
  2. Code Optimization
  3. Deep Learning
  4. Distributed Systems
  5. GPU
  6. Polyhedral Model
  7. Tensors

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SENNA: Unified Hardware/Software Space Exploration for Parametrizable Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/370573124:2(1-26)Online publication date: 26-Nov-2024
  • (2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
  • (2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
  • (2024)Scheduling for Cyber-Physical Systems with Heterogeneous Processing Units under Real-World ConstraintsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656625(298-311)Online publication date: 30-May-2024
  • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
  • (2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
  • (2023)Heron: Automatically Constrained High-Performance Library Generation for Deep Learning AcceleratorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582061(314-328)Online publication date: 25-Mar-2023
  • (2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
  • (2023)Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance OptimizationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593714(50-62)Online publication date: 21-Jun-2023
  • (2023)WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor ProgramProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575742(920-934)Online publication date: 27-Jan-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media