Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Published: 29 August 2015 Publication History

Abstract

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.

References

[1]
AMD Accelerated Parallel Processing OpenCL Programming Guide. AMD, 2013.
[2]
C. Andreetta, V. Begot, J. Berthold, M. Elsman, T. Henriksen, M.-B. Nordfang, and C. Oancea. A financial benchmark for GPGPU compilation. Technical Report no 2015/02, University of Copenhagen, 2015. Extended version of CPC’15 paper.
[3]
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: a language and compiler for algorithmic choice. PLDI. ACM, 2009.
[4]
L. Bergstrom and J. H. Reppy. Nested data-parallelism on the GPU. ICFP. ACM, 2012.
[5]
R. S. Bird. An introduction to the theory of lists. In Logic of Programming and Calculi of Discrete Design, Nato ASI Series. Springer New York, 1987.
[6]
K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domainspecific languages. PACT. ACM, 2011.
[7]
B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. PPoPP. ACM, 2011.
[8]
H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. PPoPP. ACM, 2011.
[9]
M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. DAMP. ACM, 2011.
[10]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC. IEEE, 2009.
[11]
M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989.
[12]
A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. NOVA: A functional language for data parallelism. ARRAY. ACM, 2014.
[13]
D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: from lists to streams to nothing at all. ICFP. ACM, 2007.
[14]
D. Cunningham, R. Bordawekar, and V. Saraswat. GPU programming in a high level language: compiling X10 to CUDA. X10. ACM, 2011.
[15]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. GPGPU. ACM, 2010.
[16]
J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, and Q. Wu. Parallel programming using skeleton functions. PARLE. Springer, 1993.
[17]
F. de Mesmay, A. Rimmel, Y. Voronenko, and M. Püschel. Banditbased optimization on graphs with application to library performance tuning. ICML. ACM, 2009.
[18]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communication of the ACM, 51(1), 2008.
[19]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI. ACM, 2012.
[20]
C. H. González and B. B. Fraguela. An algorithm template for domainbased parallel irregular algorithms. International Journal of Parallel Programming, 42(6):948–967, 2014.
[21]
T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. FERRY: databasesupported program execution. SIGMOD. ACM, 2009.
[22]
T. D. Han and T. S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1), Jan. 2011.
[23]
M. Harris. Optimizing Parallel Reduction in CUDA. Nvidia, 2007.
[24]
E. Holk, W. E. Byrd, N. Mahajan, J. Willcock, A. Chauhan, and A. Lumsdaine. Declarative parallel programming for GPUs. PARCO. IOS Press, 2011.
[25]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable stream programming on graphics engines. ASPLOS. ACM, 2011.
[26]
S. P. Jones, A. Tolmach, and T. Hoare. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Haskell Workshop’01, 2001.
[27]
R. Karrenberg and S. Hack. Whole-function vectorization. CGO. IEEE, 2011.
[28]
H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on GPUs. MICRO. IEEE, 2014.
[29]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. PPoPP. ACM, 2009.
[30]
T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. ICFP. ACM, 2013.
[31]
Nvidia OpenCL Best Practices Guide. Nvidia, 2011.
[32]
A. Panyala, D. Chavarria-Miranda, and S. Krishnamoorthy. On the use of term rewriting for performance optimization of legacy HPC applications. ICPP. IEEE, 2012.
[33]
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. ASPLOS. ACM, 2013.
[34]
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. IEEE special issue on “Program Generation, Optimization, and Adaptation”, 93(2), 2005.
[35]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI. ACM, 2013.
[36]
R. Reyes, I. López-Rodríguez, J. Fumero, and F. de Sande. accULL: an OpenACC implementation with CUDA and OpenCL support. Euro-Par. Springer, 2012.
[37]
C. Rodrigues, T. Jablin, A. Dakkak, and W.-M. Hwu. Triolet: A programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing. PPoPP. ACM, 2014.
[38]
D. B. Skillicorn. Architecture-independent parallel computation. IEEE Computer, 23(12):38–50, 1990.
[39]
D. G. Spampinato and M. Püschel. A basic linear algebra compiler. CGO. ACM, 2014.
[40]
M. Steuwer. Improving Programmability and Performance Portability on Many-Core Processors. PhD thesis, University of Muenster, Germany, 2015.
[41]
M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - a portable skeleton library for high-level GPU programming. HIPS Workshop. IEEE, 2011.
[42]
J. Svensson, M. Sheeran, and K. Claessen. Obsidian: A domain specific embedded language for parallel programming of graphics processors. IFL. Springer, 2008.
[43]
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. CC. Springer, 2002.
[44]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4), 2013.
[45]
H. Xi and F. Pfenning. Dependent types in practical programming. POPL. ACM, 1999.
[46]
Y. Zhang and F. Mueller. HiDP: A hierarchical data parallel language. CGO. IEEE, 2013.

Cited By

View all
  • (2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
  • (2024)Zero-Overhead Parallel Scans for Multi-Core CPUsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649248(52-61)Online publication date: 3-Mar-2024
  • (2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
  • Show More Cited By

Index Terms

  1. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 9
      ICFP '15
      September 2015
      436 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2858949
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming
        August 2015
        436 pages
        ISBN:9781450336697
        DOI:10.1145/2784731
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 August 2015
      Published in SIGPLAN Volume 50, Issue 9

      Check for updates

      Author Tags

      1. Algorithmic patterns
      2. GPU
      3. OpenCL
      4. code generation
      5. performance portability
      6. rewrite rules

      Qualifiers

      • Research-article

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)35
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 08 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
      • (2024)Zero-Overhead Parallel Scans for Multi-Core CPUsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649248(52-61)Online publication date: 3-Mar-2024
      • (2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
      • (2023)BaCO: A Fast and Portable Bayesian Compiler Optimization FrameworkProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624770(19-42)Online publication date: 25-Mar-2023
      • (2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
      • (2021)A Theoretical Model for Global Optimization of Parallel AlgorithmsMathematics10.3390/math91416859:14(1685)Online publication date: 17-Jul-2021
      • (2021)Code Generation for Room Acoustics Simulations with Complex Boundary Conditions2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00057(485-496)Online publication date: May-2021
      • (2021)Towards a domain-extensible compilerProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370337(27-38)Online publication date: 27-Feb-2021
      • (2021)CinnamonProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370313(103-114)Online publication date: 27-Feb-2021
      • (2021)HipaccVX: wedding of OpenVX and DSL-based code generationJournal of Real-Time Image Processing10.1007/s11554-020-01015-518:3(765-777)Online publication date: 1-Jun-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media