research-article

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

Authors:

Mehrzad Samadi,

Scott MahlkeAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 33, Issue 3

Article No.: 9, Pages 1 - 27

https://doi.org/10.1145/2798725

Published: 31 August 2015 Publication History

Abstract

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

References

[1]

AMD. 2012. Accelerated Parallel Processing (APP) SDK. http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.

[2]

Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. 89--100.

Digital Library

[3]

Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 353--364.

Digital Library

[4]

Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. 197--200.

Digital Library

[5]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual International Symposium on Microarchitecture. 407--420.

Digital Library

[6]

M. R. Garey and D. S. Johnson. 1990. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY.

Digital Library

[7]

Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin Peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 205--216.

Digital Library

[8]

Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture. 152--163.

Digital Library

[9]

Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke. 2011. Sponge: Portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 381--392.

Digital Library

[10]

Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In Proceedings of the 2012 IEEE Symposium on Performance Analysis of Systems and Software. 2--13.

Digital Library

[11]

Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 2011 International Symposium on Code Generation and Optimization.

Digital Library

[12]

Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). 401--412.

Digital Library

[13]

Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5, 7--17.

Digital Library

[14]

Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Traff, and Sabri Pllana. 2012. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. In Proceedings of the 2012 Design, Automation and Test in Europe. 1403--1408.

Digital Library

[15]

KHRONOS. 2014. OpenCL—The open standard for parallel programming of heterogeneous systems. http://www.khronos.org.

[16]

Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 277--288.

Digital Library

[17]

Manjunath Kudlur and Scott Mahlke. 2008. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of the ’08 Conference on Programming Language Design and Implementation. 114--124.

Digital Library

[18]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization. 75--86.

Digital Library

[19]

Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. 2014. VAST: The illusion of a large memory space for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. 443--454.

Digital Library

[20]

Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 245--256.

Digital Library

[21]

Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 270--279.

Digital Library

[22]

Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 451--460.

Digital Library

[23]

Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. 2008. Merge: A programming model for heterogeneous multi-core systems. In 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 287--296.

Digital Library

[24]

LLVM. 2014. libclc. Retrieved July 23, 2015 from http://libclc.llvm.org.

[25]

Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual International Symposium on Microarchitecture. 45--55.

Digital Library

[26]

Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. 2001. Introduction to linear regression analysis (3rd ed.). Wiley, New York, NY.

Digital Library

[27]

NVIDIA. 2012. CUDA Toolkit 4.2. Retrieved July 23, 2015 from https://developer.nvidia.com/cuda-toolkit-42-archive.

[28]

NVIDIA. 2014a. CUDA C Programming Guide. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda.

[29]

NVIDIA. 2014b. PTX: Parallel Thread Execution ISA. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda/parallel-thread-execution.

[30]

David A. Padua and Michael J. Wolfe. 1986. Advanced compiler optimizations for supercomputers. Communications of the ACM 29, 12, 1184--1201.

Digital Library

[31]

J. R. Quinlan. 1986. Induction of decision trees. Journal of Machine Learning 1, 1, 81--106.

Digital Library

[32]

Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 233--248.

Digital Library

[33]

Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. 49--68.

Digital Library

[34]

John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 16--30.

[35]

Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 109--120.

Digital Library

[36]

Linda Torczon and Keith Cooper. 2011. Engineering A Compiler (2nd ed.). Morgan Kaufmann Publishers Inc., Burlington, MA.

Digital Library

[37]

Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang. 2014. GDM: Device memory management for GPGPU computing. In 2014 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 533--545.

Digital Library

Cited By

Han RChen JGarg BZhou XLu JYoung JSim JKim H(2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 23-Apr-2024
https://dl.acm.org/doi/10.1145/3659949
Perez BBosque J(2024)Hardware support for balanced co-execution in heterogeneous processorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649208(106-114)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649208
Lin YDeng GPrasanna V(2024)A Unified CPU-GPU Protocol for GNN TrainingProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649191(155-163)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649191
Show More Cited By

Index Terms

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Incremental compilers
      2. Runtime environments

Recommendations

gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Leveraging GPUs using cooperative loop speculation

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 33, Issue 3

September 2015

140 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/2818727

Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2015

Accepted: 01 June 2015

Revised: 01 February 2015

Received: 01 July 2014

Published in TOCS Volume 33, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
Defense Advanced Research Projects Agency under the Power Efficiency Revolution for Embedded Computing Technologies (PERFECT) program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
830
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han RChen JGarg BZhou XLu JYoung JSim JKim H(2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 23-Apr-2024
https://dl.acm.org/doi/10.1145/3659949
Perez BBosque J(2024)Hardware support for balanced co-execution in heterogeneous processorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649208(106-114)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649208
Lin YDeng GPrasanna V(2024)A Unified CPU-GPU Protocol for GNN TrainingProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649191(155-163)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649191
Chen WWu PHuang PMok AHan S(2023)Regular Composite Resource Partitioning and Reconfiguration in Open SystemsACM Transactions on Embedded Computing Systems10.1145/360942422:5(1-29)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1145/3609424
Jordan MLignati BKorol GRutzig MBeck A(2023)MVSym: Efficient symbiotic exploitation of HLS-kernel multi-versioning for collaborative CPU-FPGA cloud systemsIntegration10.1016/j.vlsi.2023.10205293(102052)Online publication date: Dec-2023
https://doi.org/10.1016/j.vlsi.2023.102052
Hayat AKhalid YRathore MNadir M(2023)A machine learning-based resource-efficient task scheduler for heterogeneous computer systemsThe Journal of Supercomputing10.1007/s11227-023-05266-479:14(15700-15728)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.1007/s11227-023-05266-4
Fu ZRen JLiu YCao TZhang DZhou YZhang YGummeson JLee SGao JXing G(2022)HyperionProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568511(607-621)Online publication date: 6-Nov-2022
https://dl.acm.org/doi/10.1145/3560905.3568511
Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Moren KGohringer D(2022)GraphCL: A Framework for Execution of Data-Flow Graphs on Multi-Device Platforms2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP55904.2022.00026(116-121)Online publication date: Mar-2022
https://doi.org/10.1109/PDP55904.2022.00026
Parasyris KGeorgakoudis GDoerfert JLaguna IScogland T(2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Dec-2022
https://doi.org/10.1109/P3HPC56579.2022.00015
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents