Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Published: 08 December 2014 Publication History

Abstract

General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators.

References

[1]
AMD. 2013. AMD/ATI Stream SDK. Retrieved October 17, 2014, from http://www.amd.com/stream/.
[2]
AMD. 2014. CodeXL—Powerful Debugging, Profiling & Analysis. Retrieved October 17, 2014, from developer.amd.com/tools-and-sdks/opencl-zone/codexl/.
[3]
Mehdi Amini, Onig Goubier, Serge Guelton, Janice Onanian McMahon, François Xavier Pasquier, Grégoire Péan, and Pierre Villalon. 2012. Par4All: From convex array regions to heterogeneous computing. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques (IMPACT’12).
[4]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2, 187--198.
[5]
Muthu M. Baskaran, J. “Ram” Ramanujam, and Ponuswamy Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction. Lecture Notes in Computer Science, Vol. 6011. Springer, 244--263.
[6]
Rajesh Bordawekar, Uday Bondhugula, and Ravi Rao. 2010. Believe it or not! Multi-core CPUs can match GPU performance for a FLOP-intensive application! In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 537--538.
[7]
Gautam Chakrabarti, Vinod Grover, Bastiaan Aarts, Xiangyun Kong, Manjunath Kudlur, Yuan Lin, Jaydeep Marathe, Mike Murphy, and Jian-Zhong Wang. 2012. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science 9, 1910--1919.
[8]
Shuai Che, Jeremy W. Sheaffer, and Kevin Skadron. 2011. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). Article No. 13.
[9]
Alexander Collins, Christian Fensch, Hugh Leather, and Murray Cole. 2013. MaSiF: Machine learning guided auto-tuning of parallel skeletons. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). 186--195.
[10]
Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES’99). 1--9.
[11]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU’10). 63--74.
[12]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). Article No. 4.
[13]
Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’04). 82--93.
[14]
Murali K. Emani, Zheng Wang, and Michael F. P. O’Boyle. 2013. Smart, adaptive mapping of parallelism in the presence of external workload. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--10.
[15]
Chris Gregg, Jeff Brantley, and Kim Hazelwood. 2010. Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems. Technical Report. Department of Computer Science, University of Virginia.
[16]
Dominik Grewe and Michael O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC’11/ETAPS’11). 286--305.
[17]
Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013a. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--10.
[18]
Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013b. OpenCL task partitioning in the presence of GPU contention. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 8664. Springer, 87--101.
[19]
Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2011. A workload-aware mapping approach for data-parallel programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). 117--126.
[20]
Fred Gustavson, Lars Karlsson, and Bo Kågström. 2012. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 38, 3, Article No. 17.
[21]
Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor N. Mudge, and Scott A. Mahlke. 2011. Sponge: Portable stream programming on graphics engines. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). 381--392.
[22]
Lei Huang, Deepak Eachempati, Marcus W. Hervey, and Barbara Chapman. 2009. Exploiting global optimizations for OpenMP programs in the OpenUH compiler. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 289--290.
[23]
Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically managed data for CPU-GPU architectures. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). 165--174.
[24]
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). 142--151.
[25]
Lars Karlsson. 2009. Blocked in-place transposition with application to storage format conversion. Technical Report UMINF 09.01. Umea University, Umea, Sweden.
[26]
Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 157--166.
[27]
Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 277--288.
[28]
Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010b. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’43). 213--224.
[29]
Seyong Lee and Rudolf Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the ACM IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’10). 1--11.
[30]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 101--110.
[31]
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010a. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 451--460.
[32]
LLVM. 2013. The LLVM Compiler Infrastructure Project. Retrieved October 18, 2014, from http://llvm.org/.
[33]
John Lu and Keith D. Cooper. 1997. Register promotion in C programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’97). 308--319.
[34]
Chi-keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’42). 45--55.
[35]
Christos Margiolas and Michael F. P. O’Boyle. 2014. Portable and transparent host-device communication optimization for GPGPU environments. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). 55.
[36]
NVIDIA Corp. 2013. NVIDIA CUDA. Retrieved October 18, 2014, from http://developer.nvidia.com/object/cuda.html.
[37]
William Ogilvie, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2014. Fast automatic heuristic construction using active learning. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC’14).
[38]
Omini Compiler Project. 2009. NAS Parallel Benchmark v2.3 OpenMP C Version. Retrieved October 18, 2014, from http://www.hpcs.cs.tsukuba.ac.jp/omni-compiler/download/download-benchmarks.html.
[39]
OpenACC. 2013. The OpenACC Application Program Interface. Retrieved October 18, 2014, from http://www.openacc-standard.org/.
[40]
PathScale Inc. 2013. NPB2.3-OpenACC-C. Retrieved October 18, 2014, from https://github.com/pathscale/NPB2.3-OpenACC-C.
[41]
Portland Group. 2010. PGI Fortran & C Accelerator Programming Model. White Paper. Retrieved October 18, 2014, from http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf.
[42]
J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA.
[43]
Jagannathan Ramanujam and Ponnuswamy Sadayappan. 1989. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’89). 637--646.
[44]
Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). 73--82.
[45]
Sangmin Seo, Gangwon Jo, and Jaejin Lee. 2011. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’11). 137--148.
[46]
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). 11--22.
[47]
Bjarne Steensgaard. 1996. Points-to analysis in almost linear time. In Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’96). 32--41.
[48]
John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W. Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). 111--119.
[49]
I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil, and Wen-Mei W. Hwu. 2014. In-place transposition of rectangular matrices on accelerators. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 207--218.
[50]
I-Jui Sung, Geng D. Liu, and Wen-Mei W. Hwu. 2012. DL: A data layout transformation system for heterogeneous computing. In Proceedings of Innovative Parallel Computing (InPar). 1--11.
[51]
I-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 513--522.
[52]
Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael O’Boyle. 2009. Towards a holistic approach to auto-parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). 177--187.
[53]
Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen-Mei W. Hwu. 2008. CUDA-Lite: Reducing GPU programming complexity. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 5335. Springer, 1--15.
[54]
University of Illinois at Urbana-Champaign (UIUC). 2013. Parboil Benchmark Suite. Retrieved October 18, 2014, from http://impact.crhc.illinois.edu/Parboil/parboil.aspx.
[55]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 54.
[56]
Zheng Wang and Michael F. P. O’Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 75--84.
[57]
Zheng Wang and Michael F. P. O’Boyle. 2010. Partitioning streaming parallelism for multi-cores: A machine learning based approach. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 307--318.
[58]
Zheng Wang and Michael F. P. O’Boyle. 2013. Using machine learning to partition streaming programs. ACM Transactions on Architecture and Code Optimization 10, 3, Article No. 20.
[59]
Zheng Wang, Daniel Powell, Björn Franke, and Michael F. P. O’Boyle. 2014a. Exploitation of GPUs for the parallelisation of probably parallel legacy code. In Compiler Construction. Lecture Notes in Computer Science, Vol. 8409. Springer, 154--173.
[60]
Zheng Wang, Georgios Tournavitis, Björn Franke, and Michael F. P. O’Boyle. 2014b. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization 11, 1, Article 2.
[61]
Yuan Wen, Zheng Wang, and Michael O’Boyle. 2014. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms. In Proceedings of the 21st Annual IEEE International Conference on High Performance Computing (HiPC’14).
[62]
Michael Wolfe. 2010. Implementing the PGI accelerator model. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU’10).
[63]
Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). 86--97.

Cited By

View all
  • (2022)Fuzzy Active Learning to Detect OpenCL Kernel Heterogeneous Machines in Cyber Physical SystemsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.316715830:11(4618-4629)Online publication date: Nov-2022
  • (2021)CoopCLProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468061(1-2)Online publication date: 21-Jun-2021
  • (2021)Automated conformance testing for JavaScript engines via deep compiler fuzzingProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454054(435-450)Online publication date: 19-Jun-2021
  • Show More Cited By

Index Terms

  1. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
    January 2015
    797 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2695583
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 December 2014
    Accepted: 01 October 2014
    Revised: 01 October 2014
    Received: 01 December 2013
    Published in TACO Volume 11, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. Machine-learning mapping
    3. OpenCL

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)110
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Fuzzy Active Learning to Detect OpenCL Kernel Heterogeneous Machines in Cyber Physical SystemsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.316715830:11(4618-4629)Online publication date: Nov-2022
    • (2021)CoopCLProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468061(1-2)Online publication date: 21-Jun-2021
    • (2021)Automated conformance testing for JavaScript engines via deep compiler fuzzingProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454054(435-450)Online publication date: 19-Jun-2021
    • (2021)Towards Large-Scale Object Instance Search: A Multi-Block N-Ary TrieIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.296654131:1(372-386)Online publication date: 6-Jan-2021
    • (2021)ChipAdvisor: A Machine Learning Approach for Mapping Applications to Heterogeneous Systems2021 22nd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED51717.2021.9424271(292-299)Online publication date: 7-Apr-2021
    • (2020)Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model SelectionACM Transactions on Embedded Computing Systems10.1145/337115419:1(1-28)Online publication date: 6-Feb-2020
    • (2020)Optimizing Streaming Parallelism on Heterogeneous Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297804531:8(1878-1896)Online publication date: 1-Aug-2020
    • (2020)Machine Learning in Compilers: Past, Present and Future2020 Forum for Specification and Design Languages (FDL)10.1109/FDL50818.2020.9232934(1-8)Online publication date: 15-Sep-2020
    • (2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
    • (2019)Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00101(670-677)Online publication date: Aug-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media