Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2523721.2523756acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

Published: 07 October 2013 Publication History

Abstract

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortunately, this work distribution can be a poor solution as it under utilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this paper, we present the single kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitray set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 29\% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

References

[1]
"NVIDIA CUDA C Programming Guide, version 4.0," 201 AMD, "Accelerated Parallel Processing (APP) SDK," 2012, http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk.
[2]
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, "Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems," in Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, Sep. 2010, pp. 353--364.
[3]
G. F. Diamos and S. Yalamanchili, "Harmony: an execution model and runtime for heterogeneous many core systems," in Proc. of the 17th international symposium on High performance distributed computing, 2008, pp. 197--200.
[4]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proc. of the 40th Annual International Symposium on Microarchitecture, 2007, pp. 407--420.
[5]
M. Garey and D. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1990.
[6]
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng, "Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors," in Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, Sep. 2010, pp. 205--216.
[7]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke, "Sponge: portable stream programming on graphics engines," in 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2011, pp. 381--392.
[8]
Intel, "Intel xeon processor e3--1200 product family," 2012, download.intel.com/support/processors/xeon/sb/xeon_E3--1200.pdf.
[9]
R. Karrenberg and S. Hack, "Whole-function vectorization," in Proc. of the 2011 International Symposium on Code Generation and Optimization, Apr. 2011.
[10]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011.
[11]
C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolinsky, S. Benkner, J. L. Traff, and S. Pllana, "Programmability and performance portability aspects of heterogeneous multi-/manycore systems," in Proc. of the 2012 Design, Automation and Test in Europe, Mar. 2012, pp. 1403--1408.
[12]
KHRONOS Group, "OpenCL - the open standard for parallel programming of heterogeneous systems," 2010. {Online}. Available: http://www.khronos.org
[13]
J. Kim, H. Kim, J. H. Lee, and J. Lee, "Achieving a single compute device image in opencl for multiple gpus," in Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2011, pp. 277--288.
[14]
M. Kudlur and S. Mahlke, ''Orchestrating the execution of stream programs on multicore platforms," in Proc. of the '08 Conference on Programming Language Design and Implementation, Jun. 2008, pp. 114--124.
[15]
C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in Proc. of the 2004 International Symposium on Code Generation and Optimization, 2004, pp. 75--86.
[16]
J. Lee, H. Wu, M. Ravichandran, and N. Clark, "Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications," in Proc. of the 37th Annual International Symposium on Computer Architecture, 2010, pp. 270--279.
[17]
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100x {GPU} vs. {CPU} myth: an evaluation of throughput computing on {CPU} and GPU," in Proc. of the 37th Annual International Symposium on Computer Architecture, 2010, pp. 451--460.
[18]
M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng, "Merge: a programming model for heterogeneous multi-core systems," in 16th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008, pp. 287--296.
[19]
LLVM, "libclc," 2012, http://libclc.llvm.org.
[20]
C.-K. Luk, S. Hong, and H. Kim, "Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in Proc. of the 42nd Annual International Symposium on Microarchitecture, 2009, pp. 45--55.
[21]
NVidia, "Ptx: Parallel thread execution isa," http://docs.nvidia.com/cuda/parallel-thread-execution/.
[22]
Nvidia, "Cuda Zone," 2009, https://developer.nvidia.com/category/zone/cuda-zone.
[23]
NVIDIA, "Fermi: Nvidias next generation cuda compute architecture," 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[24]
J. R. Quinlan, "Induction of decision trees," Journal of Machine learning, vol. 1, no. 1, pp. 81--106, Mar. 1986.
[25]
J. A. Stratton, S. S. Stone, and W.-M. W. Hwu, "Mcuda: An efficient implementation of cuda kernels for multi-core cpus," in Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 16--30.
[26]
L. Torczon and K. Cooper, Engineering A Compiler, 2nd ed. Morgan Kaufmann Publishers Inc., 2011.

Cited By

View all
  • (2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
  • (2019)Cross-ISA execution of SIMD regions for improved performanceProceedings of the 12th ACM International Conference on Systems and Storage10.1145/3319647.3325832(55-67)Online publication date: 22-May-2019
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
  • Show More Cited By

Index Terms

  1. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
    October 2013
    422 pages
    ISBN:9781479910212

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 07 October 2013

    Check for updates

    Author Tags

    1. GPGPU
    2. collaboration
    3. data parallel
    4. openCL

    Qualifiers

    • Research-article

    Acceptance Rates

    PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
    • (2019)Cross-ISA execution of SIMD regions for improved performanceProceedings of the 12th ACM International Conference on Systems and Storage10.1145/3319647.3325832(55-67)Online publication date: 22-May-2019
    • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
    • (2019)μLayerProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303950(1-15)Online publication date: 25-Mar-2019
    • (2018)E-OSchedThe Journal of Supercomputing10.5555/3288339.328836774:10(5399-5431)Online publication date: 1-Oct-2018
    • (2018)CODAACM Transactions on Architecture and Code Optimization10.1145/323252115:3(1-23)Online publication date: 4-Sep-2018
    • (2018)Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00035(339-351)Online publication date: 20-Oct-2018
    • (2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
    • (2018)Accelerating Data Analytics on Integrated GPU Platforms via Runtime SpecializationInternational Journal of Parallel Programming10.1007/s10766-016-0482-x46:2(336-375)Online publication date: 1-Apr-2018
    • (2017)FinePar: irregularity-aware fine-grained workload partitioning on integrated architecturesProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049836(27-38)Online publication date: 4-Feb-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media