Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

Published: 31 August 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

    References

    [1]
    AMD. 2012. Accelerated Parallel Processing (APP) SDK. http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.
    [2]
    Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. 89--100.
    [3]
    Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 353--364.
    [4]
    Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. 197--200.
    [5]
    Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual International Symposium on Microarchitecture. 407--420.
    [6]
    M. R. Garey and D. S. Johnson. 1990. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY.
    [7]
    Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin Peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 205--216.
    [8]
    Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture. 152--163.
    [9]
    Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke. 2011. Sponge: Portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 381--392.
    [10]
    Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In Proceedings of the 2012 IEEE Symposium on Performance Analysis of Systems and Software. 2--13.
    [11]
    Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 2011 International Symposium on Code Generation and Optimization.
    [12]
    Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). 401--412.
    [13]
    Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5, 7--17.
    [14]
    Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Traff, and Sabri Pllana. 2012. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. In Proceedings of the 2012 Design, Automation and Test in Europe. 1403--1408.
    [15]
    KHRONOS. 2014. OpenCL—The open standard for parallel programming of heterogeneous systems. http://www.khronos.org.
    [16]
    Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 277--288.
    [17]
    Manjunath Kudlur and Scott Mahlke. 2008. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of the ’08 Conference on Programming Language Design and Implementation. 114--124.
    [18]
    Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization. 75--86.
    [19]
    Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. 2014. VAST: The illusion of a large memory space for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. 443--454.
    [20]
    Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 245--256.
    [21]
    Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 270--279.
    [22]
    Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 451--460.
    [23]
    Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. 2008. Merge: A programming model for heterogeneous multi-core systems. In 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 287--296.
    [24]
    LLVM. 2014. libclc. Retrieved July 23, 2015 from http://libclc.llvm.org.
    [25]
    Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual International Symposium on Microarchitecture. 45--55.
    [26]
    Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. 2001. Introduction to linear regression analysis (3rd ed.). Wiley, New York, NY.
    [27]
    NVIDIA. 2012. CUDA Toolkit 4.2. Retrieved July 23, 2015 from https://developer.nvidia.com/cuda-toolkit-42-archive.
    [28]
    NVIDIA. 2014a. CUDA C Programming Guide. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda.
    [29]
    NVIDIA. 2014b. PTX: Parallel Thread Execution ISA. Retrieved July 23, 2015 from http://docs.nvidia.com/cuda/parallel-thread-execution.
    [30]
    David A. Padua and Michael J. Wolfe. 1986. Advanced compiler optimizations for supercomputers. Communications of the ACM 29, 12, 1184--1201.
    [31]
    J. R. Quinlan. 1986. Induction of decision trees. Journal of Machine Learning 1, 1, 81--106.
    [32]
    Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 233--248.
    [33]
    Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. 49--68.
    [34]
    John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 16--30.
    [35]
    Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 109--120.
    [36]
    Linda Torczon and Keith Cooper. 2011. Engineering A Compiler (2nd ed.). Morgan Kaufmann Publishers Inc., Burlington, MA.
    [37]
    Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang. 2014. GDM: Device memory management for GPGPU computing. In 2014 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 533--545.

    Cited By

    View all
    • (2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 23-Apr-2024
    • (2024)Hardware support for balanced co-execution in heterogeneous processorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649208(106-114)Online publication date: 7-May-2024
    • (2024)A Unified CPU-GPU Protocol for GNN TrainingProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649191(155-163)Online publication date: 7-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 33, Issue 3
    September 2015
    140 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/2818727
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 August 2015
    Accepted: 01 June 2015
    Revised: 01 February 2015
    Received: 01 July 2014
    Published in TOCS Volume 33, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CPU
    2. Compiler
    3. GPU
    4. collaboration
    5. optimization
    6. runtime

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Science Foundation
    • Defense Advanced Research Projects Agency under the Power Efficiency Revolution for Embedded Computing Technologies (PERFECT) program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 23-Apr-2024
    • (2024)Hardware support for balanced co-execution in heterogeneous processorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649208(106-114)Online publication date: 7-May-2024
    • (2024)A Unified CPU-GPU Protocol for GNN TrainingProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649191(155-163)Online publication date: 7-May-2024
    • (2023)Regular Composite Resource Partitioning and Reconfiguration in Open SystemsACM Transactions on Embedded Computing Systems10.1145/360942422:5(1-29)Online publication date: 26-Sep-2023
    • (2023)MVSym: Efficient symbiotic exploitation of HLS-kernel multi-versioning for collaborative CPU-FPGA cloud systemsIntegration10.1016/j.vlsi.2023.10205293(102052)Online publication date: Dec-2023
    • (2023)A machine learning-based resource-efficient task scheduler for heterogeneous computer systemsThe Journal of Supercomputing10.1007/s11227-023-05266-479:14(15700-15728)Online publication date: 20-Apr-2023
    • (2022)HyperionProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568511(607-621)Online publication date: 6-Nov-2022
    • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
    • (2022)GraphCL: A Framework for Execution of Data-Flow Graphs on Multi-Device Platforms2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP55904.2022.00026(116-121)Online publication date: Mar-2022
    • (2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Dec-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media