research-article

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Authors:

Michel Steuwer,

Christian Fensch,

Christophe DubachAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 9

Pages 205 - 217

https://doi.org/10.1145/2858949.2784754

Published: 29 August 2015 Publication History

Abstract

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.

References

[1]

AMD Accelerated Parallel Processing OpenCL Programming Guide. AMD, 2013.

[2]

C. Andreetta, V. Begot, J. Berthold, M. Elsman, T. Henriksen, M.-B. Nordfang, and C. Oancea. A financial benchmark for GPGPU compilation. Technical Report no 2015/02, University of Copenhagen, 2015. Extended version of CPC’15 paper.

[3]

J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: a language and compiler for algorithmic choice. PLDI. ACM, 2009.

Digital Library

[4]

L. Bergstrom and J. H. Reppy. Nested data-parallelism on the GPU. ICFP. ACM, 2012.

Digital Library

[5]

R. S. Bird. An introduction to the theory of lists. In Logic of Programming and Calculi of Discrete Design, Nato ASI Series. Springer New York, 1987.

Digital Library

[6]

K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domainspecific languages. PACT. ACM, 2011.

Digital Library

[7]

B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. PPoPP. ACM, 2011.

Digital Library

[8]

H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. PPoPP. ACM, 2011.

Digital Library

[9]

M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. DAMP. ACM, 2011.

Digital Library

[10]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC. IEEE, 2009.

Digital Library

[11]

M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989.

Digital Library

[12]

A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. NOVA: A functional language for data parallelism. ARRAY. ACM, 2014.

Digital Library

[13]

D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: from lists to streams to nothing at all. ICFP. ACM, 2007.

Digital Library

[14]

D. Cunningham, R. Bordawekar, and V. Saraswat. GPU programming in a high level language: compiling X10 to CUDA. X10. ACM, 2011.

Digital Library

[15]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. GPGPU. ACM, 2010.

Digital Library

[16]

J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, and Q. Wu. Parallel programming using skeleton functions. PARLE. Springer, 1993.

Digital Library

[17]

F. de Mesmay, A. Rimmel, Y. Voronenko, and M. Püschel. Banditbased optimization on graphs with application to library performance tuning. ICML. ACM, 2009.

Digital Library

[18]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communication of the ACM, 51(1), 2008.

Digital Library

[19]

C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI. ACM, 2012.

Digital Library

[20]

C. H. González and B. B. Fraguela. An algorithm template for domainbased parallel irregular algorithms. International Journal of Parallel Programming, 42(6):948–967, 2014.

Digital Library

[21]

T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. FERRY: databasesupported program execution. SIGMOD. ACM, 2009.

Digital Library

[22]

T. D. Han and T. S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1), Jan. 2011.

Digital Library

[23]

M. Harris. Optimizing Parallel Reduction in CUDA. Nvidia, 2007.

[24]

E. Holk, W. E. Byrd, N. Mahajan, J. Willcock, A. Chauhan, and A. Lumsdaine. Declarative parallel programming for GPUs. PARCO. IOS Press, 2011.

[25]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable stream programming on graphics engines. ASPLOS. ACM, 2011.

Digital Library

[26]

S. P. Jones, A. Tolmach, and T. Hoare. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Haskell Workshop’01, 2001.

[27]

R. Karrenberg and S. Hack. Whole-function vectorization. CGO. IEEE, 2011.

Digital Library

[28]

H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on GPUs. MICRO. IEEE, 2014.

Digital Library

[29]

S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. PPoPP. ACM, 2009.

Digital Library

[30]

T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. ICFP. ACM, 2013.

Digital Library

[31]

Nvidia OpenCL Best Practices Guide. Nvidia, 2011.

[32]

A. Panyala, D. Chavarria-Miranda, and S. Krishnamoorthy. On the use of term rewriting for performance optimization of legacy HPC applications. ICPP. IEEE, 2012.

Digital Library

[33]

P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. ASPLOS. ACM, 2013.

Digital Library

[34]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. IEEE special issue on “Program Generation, Optimization, and Adaptation”, 93(2), 2005.

[35]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI. ACM, 2013.

Digital Library

[36]

R. Reyes, I. López-Rodríguez, J. Fumero, and F. de Sande. accULL: an OpenACC implementation with CUDA and OpenCL support. Euro-Par. Springer, 2012.

Digital Library

[37]

C. Rodrigues, T. Jablin, A. Dakkak, and W.-M. Hwu. Triolet: A programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing. PPoPP. ACM, 2014.

Digital Library

[38]

D. B. Skillicorn. Architecture-independent parallel computation. IEEE Computer, 23(12):38–50, 1990.

Digital Library

[39]

D. G. Spampinato and M. Püschel. A basic linear algebra compiler. CGO. ACM, 2014.

Digital Library

[40]

M. Steuwer. Improving Programmability and Performance Portability on Many-Core Processors. PhD thesis, University of Muenster, Germany, 2015.

[41]

M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - a portable skeleton library for high-level GPU programming. HIPS Workshop. IEEE, 2011.

Digital Library

[42]

J. Svensson, M. Sheeran, and K. Claessen. Obsidian: A domain specific embedded language for parallel programming of graphics processors. IFL. Springer, 2008.

Digital Library

[43]

W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. CC. Springer, 2002.

Digital Library

[44]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4), 2013.

Digital Library

[45]

H. Xi and F. Pfenning. Dependent types in practical programming. POPL. ACM, 1999.

Digital Library

[46]

Y. Zhang and F. Mueller. HiDP: A hierarchical data parallel language. CGO. IEEE, 2013.

Digital Library

Cited By

Laird ALiu BBjørner NDehnavi M(2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656445
de Wolff Ivan Balen DKeller GMcDonell T(2024)Zero-Overhead Parallel Scans for Multi-Core CPUsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649248(52-61)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649248
Bisbas GLydike ABauer EBrown NFehr MMitchell LRodriguez-Canal GJamieson MKelly PSteuwer MGrosser TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651344
Show More Cited By

Index Terms

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types

Recommendations

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 9

ICFP '15

September 2015

436 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2858949

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming
August 2015
436 pages
ISBN:9781450336697
DOI:10.1145/2784731
General Chair:
Kathleen Fisher
Tufts University, USA
,
Program Chair:
John Reppy
University of Chicago, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 August 2015

Published in SIGPLAN Volume 50, Issue 9

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Engineering and Physical Sciences Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
868
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)4

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Laird ALiu BBjørner NDehnavi M(2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656445
de Wolff Ivan Balen DKeller GMcDonell T(2024)Zero-Overhead Parallel Scans for Multi-Core CPUsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649248(52-61)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649248
Bisbas GLydike ABauer EBrown NFehr MMitchell LRodriguez-Canal GJamieson MKelly PSteuwer MGrosser TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651344
Hellsten ESouza ALenfers JLacouture RHsu OEjjeh AKjolstad FSteuwer MOlukotun KNardi LAamodt TSwift MJerger N(2023)BaCO: A Fast and Portable Bayesian Compiler Optimization FrameworkProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624770(19-42)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624770
Sozzo EConficconi DZeni ASalaris MSciuto DSantambrogio M(2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
https://dl.acm.org/doi/10.1145/3532989
Miller JTrümper LTerboven CMüller M(2021)A Theoretical Model for Global Optimization of Parallel AlgorithmsMathematics10.3390/math91416859:14(1685)Online publication date: 17-Jul-2021
https://doi.org/10.3390/math9141685
Stoltzfus LHamilton BSteuwer MLi LDubach C(2021)Code Generation for Room Acoustics Simulations with Complex Boundary Conditions2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00057(485-496)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00057
Kœhler TSteuwer MLee J(2021)Towards a domain-extensible compilerProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370337(27-38)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370337
Arif MZhou RHo HJones TLee J(2021)CinnamonProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370313(103-114)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370313
Özkan MOk BQiao BTeich JHannig F(2021)HipaccVX: wedding of OpenVX and DSL-based code generationJournal of Real-Time Image Processing10.1007/s11554-020-01015-518:3(765-777)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s11554-020-01015-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents