research-article

Open access

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Authors:

Michael F. P. O’boyleAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 42, Pages 1 - 26

https://doi.org/10.1145/2677036

Published: 08 December 2014 Publication History

Abstract

General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators.

References

[1]

AMD. 2013. AMD/ATI Stream SDK. Retrieved October 17, 2014, from http://www.amd.com/stream/.

[2]

AMD. 2014. CodeXL—Powerful Debugging, Profiling & Analysis. Retrieved October 17, 2014, from developer.amd.com/tools-and-sdks/opencl-zone/codexl/.

[3]

Mehdi Amini, Onig Goubier, Serge Guelton, Janice Onanian McMahon, François Xavier Pasquier, Grégoire Péan, and Pierre Villalon. 2012. Par4All: From convex array regions to heterogeneous computing. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques (IMPACT’12).

[4]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2, 187--198.

Digital Library

[5]

Muthu M. Baskaran, J. “Ram” Ramanujam, and Ponuswamy Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Compiler Construction. Lecture Notes in Computer Science, Vol. 6011. Springer, 244--263.

Digital Library

[6]

Rajesh Bordawekar, Uday Bondhugula, and Ravi Rao. 2010. Believe it or not&excl; Multi-core CPUs can match GPU performance for a FLOP-intensive application&excl; In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 537--538.

Digital Library

[7]

Gautam Chakrabarti, Vinod Grover, Bastiaan Aarts, Xiangyun Kong, Manjunath Kudlur, Yuan Lin, Jaydeep Marathe, Mike Murphy, and Jian-Zhong Wang. 2012. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science 9, 1910--1919.

[8]

Shuai Che, Jeremy W. Sheaffer, and Kevin Skadron. 2011. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). Article No. 13.

Digital Library

[9]

Alexander Collins, Christian Fensch, Hugh Leather, and Murray Cole. 2013. MaSiF: Machine learning guided auto-tuning of parallel skeletons. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). 186--195.

[10]

Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES’99). 1--9.

Digital Library

[11]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU’10). 63--74.

Digital Library

[12]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). Article No. 4.

Digital Library

[13]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’04). 82--93.

Digital Library

[14]

Murali K. Emani, Zheng Wang, and Michael F. P. O’Boyle. 2013. Smart, adaptive mapping of parallelism in the presence of external workload. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--10.

Digital Library

[15]

Chris Gregg, Jeff Brantley, and Kim Hazelwood. 2010. Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems. Technical Report. Department of Computer Science, University of Virginia.

[16]

Dominik Grewe and Michael O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC’11/ETAPS’11). 286--305.

Digital Library

[17]

Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013a. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--10.

Digital Library

[18]

Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013b. OpenCL task partitioning in the presence of GPU contention. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 8664. Springer, 87--101.

[19]

Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2011. A workload-aware mapping approach for data-parallel programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). 117--126.

Digital Library

[20]

Fred Gustavson, Lars Karlsson, and Bo Kågström. 2012. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 38, 3, Article No. 17.

Digital Library

[21]

Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor N. Mudge, and Scott A. Mahlke. 2011. Sponge: Portable stream programming on graphics engines. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). 381--392.

Digital Library

[22]

Lei Huang, Deepak Eachempati, Marcus W. Hervey, and Barbara Chapman. 2009. Exploiting global optimizations for OpenMP programs in the OpenUH compiler. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 289--290.

Digital Library

[23]

Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically managed data for CPU-GPU architectures. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). 165--174.

Digital Library

[24]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). 142--151.

Digital Library

[25]

Lars Karlsson. 2009. Blocked in-place transposition with application to storage format conversion. Technical Report UMINF 09.01. Umea University, Umea, Sweden.

[26]

Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 157--166.

Digital Library

[27]

Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 277--288.

Digital Library

[28]

Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010b. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’43). 213--224.

Digital Library

[29]

Seyong Lee and Rudolf Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the ACM IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’10). 1--11.

Digital Library

[30]

Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 101--110.

Digital Library

[31]

Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010a. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 451--460.

Digital Library

[32]

LLVM. 2013. The LLVM Compiler Infrastructure Project. Retrieved October 18, 2014, from http://llvm.org/.

[33]

John Lu and Keith D. Cooper. 1997. Register promotion in C programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’97). 308--319.

Digital Library

[34]

Chi-keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’42). 45--55.

Digital Library

[35]

Christos Margiolas and Michael F. P. O’Boyle. 2014. Portable and transparent host-device communication optimization for GPGPU environments. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). 55.

Digital Library

[36]

NVIDIA Corp. 2013. NVIDIA CUDA. Retrieved October 18, 2014, from http://developer.nvidia.com/object/cuda.html.

[37]

William Ogilvie, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2014. Fast automatic heuristic construction using active learning. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC’14).

[38]

Omini Compiler Project. 2009. NAS Parallel Benchmark v2.3 OpenMP C Version. Retrieved October 18, 2014, from http://www.hpcs.cs.tsukuba.ac.jp/omni-compiler/download/download-benchmarks.html.

[39]

OpenACC. 2013. The OpenACC Application Program Interface. Retrieved October 18, 2014, from http://www.openacc-standard.org/.

[40]

PathScale Inc. 2013. NPB2.3-OpenACC-C. Retrieved October 18, 2014, from https://github.com/pathscale/NPB2.3-OpenACC-C.

[41]

Portland Group. 2010. PGI Fortran & C Accelerator Programming Model. White Paper. Retrieved October 18, 2014, from http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf.

[42]

J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA.

Digital Library

[43]

Jagannathan Ramanujam and Ponnuswamy Sadayappan. 1989. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’89). 637--646.

Digital Library

[44]

Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). 73--82.

Digital Library

[45]

Sangmin Seo, Gangwon Jo, and Jaejin Lee. 2011. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’11). 137--148.

Digital Library

[46]

Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). 11--22.

Digital Library

[47]

Bjarne Steensgaard. 1996. Points-to analysis in almost linear time. In Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’96). 32--41.

Digital Library

[48]

John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W. Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). 111--119.

Digital Library

[49]

I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil, and Wen-Mei W. Hwu. 2014. In-place transposition of rectangular matrices on accelerators. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 207--218.

Digital Library

[50]

I-Jui Sung, Geng D. Liu, and Wen-Mei W. Hwu. 2012. DL: A data layout transformation system for heterogeneous computing. In Proceedings of Innovative Parallel Computing (InPar). 1--11.

[51]

I-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 513--522.

Digital Library

[52]

Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael O’Boyle. 2009. Towards a holistic approach to auto-parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). 177--187.

Digital Library

[53]

Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen-Mei W. Hwu. 2008. CUDA-Lite: Reducing GPU programming complexity. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 5335. Springer, 1--15.

Digital Library

[54]

University of Illinois at Urbana-Champaign (UIUC). 2013. Parboil Benchmark Suite. Retrieved October 18, 2014, from http://impact.crhc.illinois.edu/Parboil/parboil.aspx.

[55]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 54.

Digital Library

[56]

Zheng Wang and Michael F. P. O’Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 75--84.

Digital Library

[57]

Zheng Wang and Michael F. P. O’Boyle. 2010. Partitioning streaming parallelism for multi-cores: A machine learning based approach. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 307--318.

Digital Library

[58]

Zheng Wang and Michael F. P. O’Boyle. 2013. Using machine learning to partition streaming programs. ACM Transactions on Architecture and Code Optimization 10, 3, Article No. 20.

Digital Library

[59]

Zheng Wang, Daniel Powell, Björn Franke, and Michael F. P. O’Boyle. 2014a. Exploitation of GPUs for the parallelisation of probably parallel legacy code. In Compiler Construction. Lecture Notes in Computer Science, Vol. 8409. Springer, 154--173.

[60]

Zheng Wang, Georgios Tournavitis, Björn Franke, and Michael F. P. O’Boyle. 2014b. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization 11, 1, Article 2.

Digital Library

[61]

Yuan Wen, Zheng Wang, and Michael O’Boyle. 2014. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms. In Proceedings of the 21st Annual IEEE International Conference on High Performance Computing (HiPC’14).

[62]

Michael Wolfe. 2010. Implementing the PGI accelerator model. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU’10).

Digital Library

[63]

Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). 86--97.

Digital Library

Cited By

Ahmed ULin JSrivastava GMekala MJung H(2022)Fuzzy Active Learning to Detect OpenCL Kernel Heterogeneous Machines in Cyber Physical SystemsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.316715830:11(4618-4629)Online publication date: Nov-2022
https://doi.org/10.1109/TFUZZ.2022.3167158
Moreń KGöhringer D(2021)CoopCLProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468061(1-2)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3468044.3468061
Ye GTang ZTan SHuang SFang DSun XBian LWang HWang ZFreund SYahav E(2021)Automated conformance testing for JavaScript engines via deep compiler fuzzingProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454054(435-450)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454054
Show More Cited By

Index Terms

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Portable mapping of data parallel programs to OpenCL for heterogeneous systems
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Re-alizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2014

Accepted: 01 October 2014

Revised: 01 October 2014

Received: 01 December 2013

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
1,128
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)16

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ahmed ULin JSrivastava GMekala MJung H(2022)Fuzzy Active Learning to Detect OpenCL Kernel Heterogeneous Machines in Cyber Physical SystemsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.316715830:11(4618-4629)Online publication date: Nov-2022
https://doi.org/10.1109/TFUZZ.2022.3167158
Moreń KGöhringer D(2021)CoopCLProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468061(1-2)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3468044.3468061
Ye GTang ZTan SHuang SFang DSun XBian LWang HWang ZFreund SYahav E(2021)Automated conformance testing for JavaScript engines via deep compiler fuzzingProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454054(435-450)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454054
Feng DLiang MGao FHuang YZhang XDuan L(2021)Towards Large-Scale Object Instance Search: A Multi-Block N-Ary TrieIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.296654131:1(372-386)Online publication date: 6-Jan-2021
https://dl.acm.org/doi/10.1109/TCSVT.2020.2966541
Kassa HVerma TAustin TBertacco V(2021)ChipAdvisor: A Machine Learning Approach for Mapping Applications to Heterogeneous Systems2021 22nd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED51717.2021.9424271(292-299)Online publication date: 7-Apr-2021
https://doi.org/10.1109/ISQED51717.2021.9424271
Marco VTaylor BWang ZElkhatib Y(2020)Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model SelectionACM Transactions on Embedded Computing Systems10.1145/337115419:1(1-28)Online publication date: 6-Feb-2020
https://dl.acm.org/doi/10.1145/3371154
Zhang PFang JYang CHuang CTang TWang Z(2020)Optimizing Streaming Parallelism on Heterogeneous Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297804531:8(1878-1896)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TPDS.2020.2978045
Leather HCummins C(2020)Machine Learning in Compilers: Past, Present and Future2020 Forum for Specification and Design Languages (FDL)10.1109/FDL50818.2020.9232934(1-8)Online publication date: 15-Sep-2020
https://doi.org/10.1109/FDL50818.2020.9232934
Fang JHuang CTang TWang Z(2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
https://doi.org/10.1007/s42514-020-00039-4
Zheng WFang JJuan CWu FPan XWang HSun XYuan YXie MHuang CTang TWang Z(2019)Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00101(670-677)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00101
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents