research-article

Achieving a single compute device image in OpenCL for multiple GPUs

Authors:

Jaejin LeeAuthors Info & Claims

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Pages 277 - 288

https://doi.org/10.1145/1941553.1941591

Published: 12 February 2011 Publication History

Abstract

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.

References

[1]

ATI Stream Software Development Ket (SDK) v2.1. AMD, 2010. http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx.

[2]

G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS'67 (Spring): Proceedings of the April 18--20, 1967, spring joint computer conference, pages 483--485. ACM, 1967.

Digital Library

[3]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT'08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81. ACM, October 2008.

Digital Library

[4]

F. Darema. The SPMD Model: Past, Present and Future. Lecture Notes in Computer Science, 2131 (1): 1--1, January 2001.

Digital Library

[5]

I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, and W.-m. W. Hwu. CUBA: an architecture for efficient CPU/co-processor data communication. In ICS'08: Proceedings of the 22nd annual international conference on Supercomputing, pages 299--308. ACM, June 2008.

Digital Library

[6]

J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 205--216. ACM, 2010.

Digital Library

[7]

Khronos OpenCL Working Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. http://www.khronos.org/opencl.

[8]

D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010. ISBN 0123814723, 9780123814722.

Digital Library

[9]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04: Proceedings of the international symposium on Code generation and optimization, pages 75--86, Washington, DC, USA, March 2004. IEEE Computer Society.

Digital Library

[10]

J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, and J.-D. Choi. An OpenCL framework for heterogeneous multicores with local memory. In PACT'10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 193--204. ACM, 2010.

Digital Library

[11]

S. S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-320-4.

Digital Library

[12]

NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.

[13]

NVIDIA Fermi Compute Architecture White Paper. NVIDIA, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[14]

NVIDIA CUDA C Best Practices Guide 3.1. NVIDIA, May 2010.

[15]

NVIDIA CUDA C Programming Guide 3.1.1. NVIDIA, July 2010.

[16]

NVIDIA CUDA Zone. NVIDIA, July 2010. http://www.nvidia.com/object/cuda_home_new.html.

[17]

NVIDIA GPU Computing Software Development Kit. NVIDIA, June 2010. http://developer.nvidia.com/object/cuda_3_1_downloads.html.

[18]

Tesla M2050/M2070 GPU Computing Module. NVIDIA, 2010. http://www.nvidia.com/object/product_tesla_M2050_M2070_us.html.

[19]

J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--9, Piscataway, NJ, USA, November 2008. IEEE Press.

Digital Library

[20]

G. Quintana-Ortí, F. D. Igual, E. S. Quintana-Ortí, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP'09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 121--130. ACM, 2009.

Digital Library

[21]

J. W. Romein, P. C. Broekema, J. D. Mol, and R. V. van Nieuwpoort. The LOFAR correlator: implementation and performance analysis. In PPoPP'10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 169--178. ACM, 2010.

Digital Library

[22]

D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pages 1--12, May 2009.

Digital Library

[23]

M. Strengert, C. Müler, C. Dachsbacher, and T. Ertl. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV08), pages 49--56. Eurographics Association, April 2008.

Digital Library

[24]

The IMPACT Research Group. Parboil Benchmark suite. http://impact.crhc.illinois.edu/parboil.php, 2009.

[25]

F. Tip. A Survey of Program Slicing Techniques. Technical report, Amsterdam, The Netherlands, The Netherlands, 1994.

Digital Library

[26]

V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[27]

M. Weiser. Program Slicing. In ICSE'81: Proceedings of the 5th International Conference on Software Engineering, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press.

Digital Library

[28]

C. Yang, F. Wang, Y. Du, J. Chen, J. Liu, H. Yi, and K. Lu. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing. In IEEE Cluster'10: Proceedings of IEEE International Conference on Cluster Computing, pages 19--28, Los Alamitos, CA, USA, 2010. IEEE Computer Society.

Digital Library

[29]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI'10: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, pages 86--97. ACM, June 2010.

Digital Library

Cited By

Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Jordan MVicenzi JKnorst TKorol GBeck ARutzig M(2023)Multiprovision: a Design Space Exploration tool for multi-tenant resource provisioning in CPU–GPU environmentsDesign Automation for Embedded Systems10.1007/s10617-023-09279-327:4(241-273)Online publication date: 21-Dec-2023
https://doi.org/10.1007/s10617-023-09279-3
Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Show More Cited By

Index Terms

Achieving a single compute device image in OpenCL for multiple GPUs
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation
    2. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Achieving a single compute device image in OpenCL for multiple GPUs
PPoPP '11

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the ...
An OpenCL framework for heterogeneous multicores with local memory
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and ...
Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support

The heterogeneous system architecture (HSA), announced by the HSA Foundation, is an approach to integrate central processing unit (CPU) and graphics processing unit (GPU) architectures. The open computing language (OpenCL) is a programming framework ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

February 2011

326 pages

ISBN:9781450301190

DOI:10.1145/1941553

General Chair:
Calin Cascaval
Qualcomm Research, USA
,
Program Chair:
Pen-Chung Yew
Academia Sinica, Taiwan and University of Minnesota at Twin Cities, USA

ACM SIGPLAN Notices Volume 46, Issue 8
PPoPP '11
August 2011
300 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2038037
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 February 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '11

Sponsor:

SIGPLAN

PPoPP '11: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 12 - 16, 2011

TX, San Antonio, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

115
Total Citations
View Citations
2,191
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Jordan MVicenzi JKnorst TKorol GBeck ARutzig M(2023)Multiprovision: a Design Space Exploration tool for multi-tenant resource provisioning in CPU–GPU environmentsDesign Automation for Embedded Systems10.1007/s10617-023-09279-327:4(241-273)Online publication date: 21-Dec-2023
https://doi.org/10.1007/s10617-023-09279-3
Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Moren KGohringer D(2022)GraphCL: A Framework for Execution of Data-Flow Graphs on Multi-Device Platforms2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP55904.2022.00026(116-121)Online publication date: Mar-2022
https://doi.org/10.1109/PDP55904.2022.00026
Heldens SHijma PVan Werkhoven BMaassen Jvan Nieuwpoort R(2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00054
Jung JPark DJo GPark JLee JLaure EMarkidis SVerbanescu ALofstead G(2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460647
Vicenzi JKnorst TJordan MKorol GBeck ARutzig M(2021)TRIPP: Transparent Resource Provisioning for Multi-Tenant CPU-GPU based Cloud Environments2021 XI Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC53686.2021.9628223(1-8)Online publication date: 22-Nov-2021
https://doi.org/10.1109/SBESC53686.2021.9628223
Kim JLee SJohnston BVetter J(2021)IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622873(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622873
Gioiosa RMutlu BLee SVetter JPicierro GCesati MJog AKayiran OPattnaik A(2020)The Minos Computing LibraryProceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/3366428.3380770(1-10)Online publication date: 23-Feb-2020
https://dl.acm.org/doi/10.1145/3366428.3380770
Giakoumakis GPapadogiannaki EVasiliadis GIoannidis S(2020)Pythia: Scheduling of Concurrent Network Packet Processing Applications on Heterogeneous Devices2020 6th IEEE Conference on Network Softwarization (NetSoft)10.1109/NetSoft48620.2020.9165447(145-149)Online publication date: Jun-2020
https://doi.org/10.1109/NetSoft48620.2020.9165447
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents