research-article

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Authors:

Alexandros Papakonstantinou,

Karthik Gururaj,

John A. Stratton,

Wen-Mei W. HwuAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 13, Issue 2

Article No.: 25, Pages 1 - 26

https://doi.org/10.1145/2514641.2514652

Published: 30 September 2013 Publication History

Abstract

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

References

[1]

Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers, Principles, Techniques and Tools, 2^nd ed. Addison-Wesley.

Digital Library

[2]

Allen, R. and Kennedy, K. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, Academic Press.

Digital Library

[3]

AMD. 2012. Accelerated processing units. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx.

[4]

BDTI. 2010. An independent evaluation of: The autoesl autopilot high-level synthesis tool. http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf.

[5]

Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the 6^th Symposium on Application Specific Processors. IEEE, 101--107.

Digital Library

[6]

Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang Z. 2005. XPilot: A platform-based behavioral synthesis system. In Proceedings of the TechCon Conference.

[7]

Cho, J., Mirzaei, S., Oberg, J., and Kastner, R. 2009. Fpga-based face detection system using haar classifiers. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 103--112.

Digital Library

[8]

CHREC. 2012. NSF center for high performance reconfigurable computing. http://www.chrec.org/facilities.html.

[9]

Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. 2011. High-level synthesis for FPGA: From prototyping to deployment. Comput. Aid. Des. Integr. Circ. Syst. 30, 4, 473--491.

Digital Library

[10]

Cong, J. and Zou, Y. 2008. Lithographic aerial image simulation with FPGA-based hardware acceleration. In Proceedings of the 16^th International Symposium on Field Programmable Gate Arrays. ACM Press, New York.

Digital Library

[11]

Convey Computer. 2011. http://www.conveycomputer.com.

[12]

Diniz, P., Hall, M., Park, J., So, B., and Ziegler, H. 2005. Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system. Microprocess. Microsyst. 29, 2--3, 51--62.

[13]

Gajski, D. 2003. NISC: The ultimate reconfigurable component. Tech. rep. 03-28. Center for Embedded Computer Systems, UCI. http://www.cecs.uci.edu/technical_report/TR03-28.pdf.

[14]

Gupta, S., Gupta, R. K., Dutt, N. D., and Nicolau, A. 2004. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Trans. Des. Autom. Electron. Syst. 9, 4, 441--470.

Digital Library

[15]

He, C., Papakonstantinou, A., and Chen, D. 2009. A novel soc architecture on fpga for ultra fast face detection. In Proceedings of the 27^th International Conference on Computer Design. IEEE, 412--418.

Digital Library

[16]

Huang, S. S., Hormati, A., Bacon, D. F., and Rabbah, R. 2008. Liquid metal: Object-oriented programming across the hardware/software boundary. In Proceedings of the 22^nd European Conference on Object-Oriented Programming. Springer, 76--103.

Digital Library

[17]

IBM. 2006. The cell architecture. http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html.

[18]

Impact. 2012. Parboil benchmarks. http://impact.crhc.illinois.edu/parboil.aspx.

[19]

Impulse. 2003. Impulse accelerated technologies inc. http://www.impulseaccelerated.com.

[20]

Hormati, A., Kudlur, M., Mahlke, S., Bacon, D., and Rabbah, R. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, New York, 41--50.

Digital Library

[21]

Khronos. 2011. OpenCL specification, version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.

[22]

Lee, S., Johnson, T. A., and Eigenmann, R. 2003. Cetus - An extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16^th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553.

[23]

Lin, M., Lebedev, I., and Wawrzynek, J. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463.

Digital Library

[24]

Ling, L., Oliver, N., Bhushan, C., Qigang, W., Chen, A., et al. 2009. High-performance, energy-efficient platforms using in-socket fpga accelerators. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 61--264.

Digital Library

[25]

LLVM. 2007. The LLVM compiler infrastructure. http://www.llvm.org.

[26]

Mentor Graphics. 2012. Catapult C synthesis overview. http://www.mentor.com/esl/catapult/overview/.

[27]

Nallatech. 2012. DATA v5. http://www.nallatech.com/Modules/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr-ii-sram-and-io-module.html.

[28]

Nvidia. 2012a. CUDA developer zone. http://developer.nvidia.com/category/zone/cuda-zone.

[29]

Nvidia. 2012b. GeForce 8 series. http://www.nvidia.com/page/geforce8.html.

[30]

Owaida, M., Bellas, N., Daloukas, K., and Antonopoulos, C. 2011. Synthesis of platform architectures from opencl programs. In Proceedings of the 19^th Symposium on Field-Programmable Custom Computing Machines. IEEE, 178--185.

Digital Library

[31]

Parker, M. 2011. Hardware-based floating-point design flow. In Proceedings of the DesignCon Conference.

[32]

Showerman, M., Enos, J., Kidratenko, C., Steffer, C., Pennington, R., and Hwu, W. W. 2009. QP: A heterogeneous multi-accelerator cluster. In Proceedings of the 10^th LCI International Conference on High-Performance Clustered Computing.

[33]

Stratton, J. A., Stone, S. S., and Hwu, W. W. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21^st International Conference on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 5335, Springer, 16--30.

Digital Library

[34]

Thomas, D. B., Howes, L., and Luk, W. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 63--72.

Digital Library

[35]

Tilera. 2012. Tilera corporation. http://www.tilera.com.

[36]

Williams, J., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the 4^th Annual Reconfigurable Systems Summer Institute.

[37]

Xilinx. 2012. Virtex-5 FXT ML510 embedded development platform. http://www.xilinx.com/products/boards-and-kits/HW-V5-ML510-G.htm.

[38]

Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., and Cong, J. 2008. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis: From Algorithm to Digital Circuit, P. Coussy and A. Morawiec, Eds., Springer, 99--112.

Cited By

Krishnasamy EVasileska IKos LBouvry P(2023)OpenMP Offloading and OpenACC Programming Model Approach for Object-Oriented Plasma Device Algorithms2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159738(326-330)Online publication date: 22-May-2023
https://doi.org/10.23919/MIPRO57284.2023.10159738
Choudhury ZGulati APurini S(2023)FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific CompilerACM Transactions on Architecture and Code Optimization10.1145/362952320:4(1-25)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629523
Samayoa WCrespo MCicuttin ACarrato S(2023)A Survey on FPGA-Based Heterogeneous Clusters ArchitecturesIEEE Access10.1109/ACCESS.2023.328843111(67679-67706)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3288431
Show More Cited By

Index Terms

Efficient compilation of CUDA kernels for high-performance computing on FPGAs
1. Computer systems organization

Recommendations

Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We evaluate the power and performance of the Rodinia benchmark suite using the Altera SDK for OpenCL targeting a Stratix V FPGA against a modern CPU and GPU. We study multiple OpenCL kernels per benchmark, ranging from direct ports of the original GPU ...
High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on Supercomputing

In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 13, Issue 2

Special issue on application-specific processors

September 2013

254 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/2514641

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 30 September 2013

Accepted: 01 August 2012

Revised: 01 February 2012

Received: 01 March 2011

Published in TECS Volume 13, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
541
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)6

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Krishnasamy EVasileska IKos LBouvry P(2023)OpenMP Offloading and OpenACC Programming Model Approach for Object-Oriented Plasma Device Algorithms2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159738(326-330)Online publication date: 22-May-2023
https://doi.org/10.23919/MIPRO57284.2023.10159738
Choudhury ZGulati APurini S(2023)FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific CompilerACM Transactions on Architecture and Code Optimization10.1145/362952320:4(1-25)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629523
Samayoa WCrespo MCicuttin ACarrato S(2023)A Survey on FPGA-Based Heterogeneous Clusters ArchitecturesIEEE Access10.1109/ACCESS.2023.328843111(67679-67706)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3288431
Chamberlain R(2020)Architecturally truly diverse systems: A reviewFuture Generation Computer Systems10.1016/j.future.2020.03.061Online publication date: Apr-2020
https://doi.org/10.1016/j.future.2020.03.061
Campbell KLin DHe LYang LGurumani SRupnow KMitra SChen D(2019)Hybrid Quick Error Detection: Validation and Debug of SoCs Through High-Level SynthesisIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.283710338:7(1345-1358)Online publication date: Jul-2019
https://doi.org/10.1109/TCAD.2018.2837103
Li BZhou QSi X(2018)Mimic computing for password recoveryFuture Generation Computer Systems10.1016/j.future.2018.02.01884(58-77)Online publication date: Jul-2018
https://doi.org/10.1016/j.future.2018.02.018
Li ZLiu LDeng YYin SWang YWei S(2017)Aggressive Pipelining of Irregular Applications on Reconfigurable HardwareACM SIGARCH Computer Architecture News10.1145/3140659.308022845:2(575-586)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080228
Li ZLiu LDeng YYin SWang YWei S(2017)Aggressive Pipelining of Irregular Applications on Reconfigurable HardwareProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080228(575-586)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080228
Campbell KZuo WChen D(2017)New advances of high-level synthesis for efficient and reliable hardware designIntegration10.1016/j.vlsi.2016.11.00658(189-214)Online publication date: Jun-2017
https://doi.org/10.1016/j.vlsi.2016.11.006
(2017)Exploiting vectorization in high level synthesis of nested irregular loopsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2017.03.00175:C(1-14)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.sysarc.2017.03.001
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents