research-article

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Authors:

Jayanth Gummaraju,

Laurent Morichetti,

Michael Houston,

Benedict R. Gaster,

Bixia ZhengAuthors Info & Claims

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Pages 205 - 216

https://doi.org/10.1145/1854273.1854302

Published: 11 September 2010 Publication History

Abstract

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs.

In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs efficiently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multicore CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets.

References

[1]

}}GPGPU. www.gpgpu.org.

[2]

}}Intel TBB. www.threadingbuildingblocks.org.

[3]

}}OpenCL. www.khronos.org/opencl/.

[4]

}}T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. ACM Trans. on Computer Systems, 10(1), 1992.

Digital Library

[5]

}}I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004.

Digital Library

[6]

}}B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In Proceedings of ASPLOS, 1998.

Digital Library

[7]

}}T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of PLDI, 1999.

Digital Library

[8]

}}C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating computing with the Cell broadband engine processor. In Proceedings of Computing Frontiers, 2008.

Digital Library

[9]

}}J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: Programming general-purpose multicore processors using streams. In Proceedings of ASPLOS XIII, 2008.

Digital Library

[10]

}}G. Hoflehner, K. Kirkegaard, R. Skinner, D. Lavery, Y.-F. Lee, and W. Li. Compiler Optimizations for Transaction Processing Workloads on Itanium R Linux Systems. In Proceedings of International Symposium on Microarchitecture, 2004.

Digital Library

[11]

}}H. P. Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of HPCA, 2005.

Digital Library

[12]

}}H. P. Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of HPCA, 2005.

Digital Library

[13]

}}N. Joukov, A. Kashyap, G. Sivathanu, and E. Zadok. KeFence: An Electric Fence for Kernel Buffers. In StorageSS, 2005.

Digital Library

[14]

}}R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32--38, 2005.

Digital Library

[15]

}}C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO '04: Proceedings of the International Symposium on Code generation and Optimization, 2004.

Digital Library

[16]

}}V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proceedings of ISCA, 2010.

Digital Library

[17]

}}NVIDIA Corporation. CUDA Programming Guide 2.0, June 2008.

[18]

}}OpenMP Architecture Review Board. OpenMP Application Program Interface 3.0, 2007.

[19]

}}J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU computing. Proceedings of the IEEE, 96(5), May 2008.

[20]

}}B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. In PLDI, 2009.

Digital Library

[21]

}}L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In SIGGRAPH, 2008.

Digital Library

[22]

}}J. Stratton, S. S. Stone, and W. W. Hwu. M-CUDA: An efficient implementation of CUDA kernels on multicores. Int'l Workshop on Languages and Compilers for Parallel Computing, 2008.

Digital Library

[23]

}}D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of ASPLOS, 2006.

Digital Library

[24]

}}C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska. Implementing network protocols at user level. IEEE/ACM Trans. Netw., 1(5), 1993.

Digital Library

[25]

}}P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proceedings of PLDI, 2007.

Digital Library

Cited By

Han RLee JSim JKim H(2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3554736
Nozal RBosque J(2022)Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulationsThe Journal of Supercomputing10.1007/s11227-022-04671-579:1(1065-1080)Online publication date: 21-Jul-2022
https://doi.org/10.1007/s11227-022-04671-5
Nozal RNiethammer CGracia JBosque J(2022)Feasibility Study of Molecular Dynamics Kernels Exploitation Using EngineCLEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_11(129-140)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_11
Show More Cited By

Index Terms

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Programming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple ...
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Programming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

September 2010

596 pages

ISBN:9781450301787

DOI:10.1145/1854273

General Chair:
Valentina Salapura
IBM TJ Watson Research Center
,
Program Chairs:
Michael Gschwind
IBM Systems & Technology Group
,
Jens Knoop
Technische Universität Wien

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP working group 10.3 on concurrent systems
IEEE CS TCPP: IEEE-CS technical committee on parallel processing
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '10

Sponsor:

IFIP WG 10.3
IEEE CS TCPP
SIGARCH
IEEE CS TCAA

PACT '10: International Conference on Parallel Architectures and Compilation Techniques

September 11 - 15, 2010

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
1,334
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han RLee JSim JKim H(2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3554736
Nozal RBosque J(2022)Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulationsThe Journal of Supercomputing10.1007/s11227-022-04671-579:1(1065-1080)Online publication date: 21-Jul-2022
https://doi.org/10.1007/s11227-022-04671-5
Nozal RNiethammer CGracia JBosque J(2022)Feasibility Study of Molecular Dynamics Kernels Exploitation Using EngineCLEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_11(129-140)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_11
Cho YNegele FPark SEgger BGross TEvripidou SStenström PO'Boyle M(2018)On-the-fly workload partitioning for integrated CPU/GPU architecturesProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243210(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243210
Zhang PFang JYang CTang THuang CWang ZKaeli DPericàs M(2018)MOCLProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203244(26-35)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203244
Chen KChen C(2018)Enabling SIMT Execution Model on Homogeneous Multi-Core SystemACM Transactions on Architecture and Code Optimization10.1145/317796015:1(1-26)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3177960
K RChiplunkar N(2018)A survey on techniques for cooperative CPU-GPU computingSustainable Computing: Informatics and Systems10.1016/j.suscom.2018.07.01019(72-85)Online publication date: Sep-2018
https://doi.org/10.1016/j.suscom.2018.07.010
Wang HGuan XWu H(2017)A Hybrid Parallel Spatial Interpolation Algorithm for Massive LiDAR Point Clouds on Heterogeneous CPU-GPU SystemsISPRS International Journal of Geo-Information10.3390/ijgi61103636:11(363)Online publication date: 16-Nov-2017
https://doi.org/10.3390/ijgi6110363
Fang JZhang PTang THuang CYang C(2017)Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC)10.1109/ISPA/IUCC.2017.00131(860-867)Online publication date: Dec-2017
https://doi.org/10.1109/ISPA/IUCC.2017.00131
Chang LHajj IRodrigues CGómez-Luna JHwu WHsu WYang CLipasti MLee H(2016)Efficient kernel synthesis for performance portable programmingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195653(1-13)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195653
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents