research-article

Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

Authors:

Mehrzad Samadi,

Scott MahlkeAuthors Info & Claims

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Pages 245 - 256

Published: 07 October 2013 Publication History

Abstract

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortunately, this work distribution can be a poor solution as it under utilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this paper, we present the single kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitray set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 29\% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

References

[1]

"NVIDIA CUDA C Programming Guide, version 4.0," 201 AMD, "Accelerated Parallel Processing (APP) SDK," 2012, http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk.

[2]

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, "Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems," in Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, Sep. 2010, pp. 353--364.

Digital Library

[3]

G. F. Diamos and S. Yalamanchili, "Harmony: an execution model and runtime for heterogeneous many core systems," in Proc. of the 17th international symposium on High performance distributed computing, 2008, pp. 197--200.

Digital Library

[4]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proc. of the 40th Annual International Symposium on Microarchitecture, 2007, pp. 407--420.

Digital Library

[5]

M. Garey and D. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1990.

Digital Library

[6]

J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng, "Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors," in Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, Sep. 2010, pp. 205--216.

Digital Library

[7]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke, "Sponge: portable stream programming on graphics engines," in 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2011, pp. 381--392.

Digital Library

[8]

Intel, "Intel xeon processor e3--1200 product family," 2012, download.intel.com/support/processors/xeon/sb/xeon_E3--1200.pdf.

[9]

R. Karrenberg and S. Hack, "Whole-function vectorization," in Proc. of the 2011 International Symposium on Code Generation and Optimization, Apr. 2011.

Digital Library

[10]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011.

Digital Library

[11]

C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolinsky, S. Benkner, J. L. Traff, and S. Pllana, "Programmability and performance portability aspects of heterogeneous multi-/manycore systems," in Proc. of the 2012 Design, Automation and Test in Europe, Mar. 2012, pp. 1403--1408.

Digital Library

[12]

KHRONOS Group, "OpenCL - the open standard for parallel programming of heterogeneous systems," 2010. {Online}. Available: http://www.khronos.org

[13]

J. Kim, H. Kim, J. H. Lee, and J. Lee, "Achieving a single compute device image in opencl for multiple gpus," in Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2011, pp. 277--288.

Digital Library

[14]

M. Kudlur and S. Mahlke, ''Orchestrating the execution of stream programs on multicore platforms," in Proc. of the '08 Conference on Programming Language Design and Implementation, Jun. 2008, pp. 114--124.

Digital Library

[15]

C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in Proc. of the 2004 International Symposium on Code Generation and Optimization, 2004, pp. 75--86.

Digital Library

[16]

J. Lee, H. Wu, M. Ravichandran, and N. Clark, "Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications," in Proc. of the 37th Annual International Symposium on Computer Architecture, 2010, pp. 270--279.

Digital Library

[17]

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100x {GPU} vs. {CPU} myth: an evaluation of throughput computing on {CPU} and GPU," in Proc. of the 37th Annual International Symposium on Computer Architecture, 2010, pp. 451--460.

Digital Library

[18]

M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng, "Merge: a programming model for heterogeneous multi-core systems," in 16th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008, pp. 287--296.

Digital Library

[19]

LLVM, "libclc," 2012, http://libclc.llvm.org.

[20]

C.-K. Luk, S. Hong, and H. Kim, "Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in Proc. of the 42nd Annual International Symposium on Microarchitecture, 2009, pp. 45--55.

Digital Library

[21]

NVidia, "Ptx: Parallel thread execution isa," http://docs.nvidia.com/cuda/parallel-thread-execution/.

[22]

Nvidia, "Cuda Zone," 2009, https://developer.nvidia.com/category/zone/cuda-zone.

[23]

NVIDIA, "Fermi: Nvidias next generation cuda compute architecture," 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[24]

J. R. Quinlan, "Induction of decision trees," Journal of Machine learning, vol. 1, no. 1, pp. 81--106, Mar. 1986.

Digital Library

[25]

J. A. Stratton, S. S. Stone, and W.-M. W. Hwu, "Mcuda: An efficient implementation of cuda kernels for multi-core cpus," in Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 16--30.

Digital Library

[26]

L. Torczon and K. Cooper, Engineering A Compiler, 2nd ed. Morgan Kaufmann Publishers Inc., 2011.

Digital Library

Cited By

Jung JPark DJo GPark JLee JLaure EMarkidis SVerbanescu ALofstead G(2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460647
Pang YLyerly RRavindran BHershcovitch MGoel AMorrison A(2019)Cross-ISA execution of SIMD regions for improved performanceProceedings of the 12th ACM International Conference on Systems and Storage10.1145/3319647.3325832(55-67)Online publication date: 22-May-2019
https://dl.acm.org/doi/10.1145/3319647.3325832
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Show More Cited By

Index Terms

Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

October 2013

422 pages

ISBN:9781479910212

Conference Chair:
Christian Fensch
University of Edinburgh, UK
,
General Chair:
Michael O'Boyle
University of Edinburgh, UK
,
Program Chairs:
André Seznec
INRIA Rennes, France
,
François Bodin
IRISA/CAPS Entreprise, France

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

IEEE Press

Publication History

Published: 07 October 2013

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
646
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jung JPark DJo GPark JLee JLaure EMarkidis SVerbanescu ALofstead G(2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460647
Pang YLyerly RRavindran BHershcovitch MGoel AMorrison A(2019)Cross-ISA execution of SIMD regions for improved performanceProceedings of the 12th ACM International Conference on Systems and Storage10.1145/3319647.3325832(55-67)Online publication date: 22-May-2019
https://dl.acm.org/doi/10.1145/3319647.3325832
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Kim YKim JChae DKim DKim J(2019)μLayerProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303950(1-15)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303950
Khalid YAleem MProdan RIqbal MIslam M(2018)E-OSchedThe Journal of Supercomputing10.5555/3288339.328836774:10(5399-5431)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.5555/3288339.3288367
Kim HHadidi RNai LKim HJayasena NEckert YKayiran OLoh G(2018)CODAACM Transactions on Architecture and Code Optimization10.1145/323252115:3(1-23)Online publication date: 4-Sep-2018
https://dl.acm.org/doi/10.1145/3232521
Young VJaleel ABolotin EEbrahimi ENellans DVilla OOskin MInoue K(2018)Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00035(339-351)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00035
Vijaykumar NEbrahimi EHsieh KGibbons PMutlu O(2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00074
Farooqui NRoy IChen YTalwar VBarik RLewis BShpeisman TSchwan K(2018)Accelerating Data Analytics on Integrated GPU Platforms via Runtime SpecializationInternational Journal of Parallel Programming10.1007/s10766-016-0482-x46:2(336-375)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0482-x
Zhang FWu BZhai JHe BChen WReddi VSmith ATang L(2017)FinePar: irregularity-aware fine-grained workload partitioning on integrated architecturesProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049836(27-38)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049836
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents