Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Published: 30 September 2013 Publication History

Abstract

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

References

[1]
Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers, Principles, Techniques and Tools, 2nd ed. Addison-Wesley.
[2]
Allen, R. and Kennedy, K. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, Academic Press.
[3]
AMD. 2012. Accelerated processing units. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx.
[4]
BDTI. 2010. An independent evaluation of: The autoesl autopilot high-level synthesis tool. http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf.
[5]
Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the 6th Symposium on Application Specific Processors. IEEE, 101--107.
[6]
Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang Z. 2005. XPilot: A platform-based behavioral synthesis system. In Proceedings of the TechCon Conference.
[7]
Cho, J., Mirzaei, S., Oberg, J., and Kastner, R. 2009. Fpga-based face detection system using haar classifiers. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 103--112.
[8]
CHREC. 2012. NSF center for high performance reconfigurable computing. http://www.chrec.org/facilities.html.
[9]
Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. 2011. High-level synthesis for FPGA: From prototyping to deployment. Comput. Aid. Des. Integr. Circ. Syst. 30, 4, 473--491.
[10]
Cong, J. and Zou, Y. 2008. Lithographic aerial image simulation with FPGA-based hardware acceleration. In Proceedings of the 16th International Symposium on Field Programmable Gate Arrays. ACM Press, New York.
[11]
Convey Computer. 2011. http://www.conveycomputer.com.
[12]
Diniz, P., Hall, M., Park, J., So, B., and Ziegler, H. 2005. Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system. Microprocess. Microsyst. 29, 2--3, 51--62.
[13]
Gajski, D. 2003. NISC: The ultimate reconfigurable component. Tech. rep. 03-28. Center for Embedded Computer Systems, UCI. http://www.cecs.uci.edu/technical_report/TR03-28.pdf.
[14]
Gupta, S., Gupta, R. K., Dutt, N. D., and Nicolau, A. 2004. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Trans. Des. Autom. Electron. Syst. 9, 4, 441--470.
[15]
He, C., Papakonstantinou, A., and Chen, D. 2009. A novel soc architecture on fpga for ultra fast face detection. In Proceedings of the 27th International Conference on Computer Design. IEEE, 412--418.
[16]
Huang, S. S., Hormati, A., Bacon, D. F., and Rabbah, R. 2008. Liquid metal: Object-oriented programming across the hardware/software boundary. In Proceedings of the 22nd European Conference on Object-Oriented Programming. Springer, 76--103.
[17]
IBM. 2006. The cell architecture. http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html.
[18]
Impact. 2012. Parboil benchmarks. http://impact.crhc.illinois.edu/parboil.aspx.
[19]
Impulse. 2003. Impulse accelerated technologies inc. http://www.impulseaccelerated.com.
[20]
Hormati, A., Kudlur, M., Mahlke, S., Bacon, D., and Rabbah, R. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, New York, 41--50.
[21]
Khronos. 2011. OpenCL specification, version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.
[22]
Lee, S., Johnson, T. A., and Eigenmann, R. 2003. Cetus - An extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553.
[23]
Lin, M., Lebedev, I., and Wawrzynek, J. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463.
[24]
Ling, L., Oliver, N., Bhushan, C., Qigang, W., Chen, A., et al. 2009. High-performance, energy-efficient platforms using in-socket fpga accelerators. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 61--264.
[25]
LLVM. 2007. The LLVM compiler infrastructure. http://www.llvm.org.
[26]
Mentor Graphics. 2012. Catapult C synthesis overview. http://www.mentor.com/esl/catapult/overview/.
[27]
Nallatech. 2012. DATA v5. http://www.nallatech.com/Modules/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr-ii-sram-and-io-module.html.
[28]
Nvidia. 2012a. CUDA developer zone. http://developer.nvidia.com/category/zone/cuda-zone.
[29]
Nvidia. 2012b. GeForce 8 series. http://www.nvidia.com/page/geforce8.html.
[30]
Owaida, M., Bellas, N., Daloukas, K., and Antonopoulos, C. 2011. Synthesis of platform architectures from opencl programs. In Proceedings of the 19th Symposium on Field-Programmable Custom Computing Machines. IEEE, 178--185.
[31]
Parker, M. 2011. Hardware-based floating-point design flow. In Proceedings of the DesignCon Conference.
[32]
Showerman, M., Enos, J., Kidratenko, C., Steffer, C., Pennington, R., and Hwu, W. W. 2009. QP: A heterogeneous multi-accelerator cluster. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing.
[33]
Stratton, J. A., Stone, S. S., and Hwu, W. W. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Conference on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 5335, Springer, 16--30.
[34]
Thomas, D. B., Howes, L., and Luk, W. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 63--72.
[35]
Tilera. 2012. Tilera corporation. http://www.tilera.com.
[36]
Williams, J., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the 4th Annual Reconfigurable Systems Summer Institute.
[37]
Xilinx. 2012. Virtex-5 FXT ML510 embedded development platform. http://www.xilinx.com/products/boards-and-kits/HW-V5-ML510-G.htm.
[38]
Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., and Cong, J. 2008. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis: From Algorithm to Digital Circuit, P. Coussy and A. Morawiec, Eds., Springer, 99--112.

Cited By

View all
  • (2023)OpenMP Offloading and OpenACC Programming Model Approach for Object-Oriented Plasma Device Algorithms2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159738(326-330)Online publication date: 22-May-2023
  • (2023)FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific CompilerACM Transactions on Architecture and Code Optimization10.1145/362952320:4(1-25)Online publication date: 25-Oct-2023
  • (2023)A Survey on FPGA-Based Heterogeneous Clusters ArchitecturesIEEE Access10.1109/ACCESS.2023.328843111(67679-67706)Online publication date: 2023
  • Show More Cited By

Index Terms

  1. Efficient compilation of CUDA kernels for high-performance computing on FPGAs

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 13, Issue 2
        Special issue on application-specific processors
        September 2013
        254 pages
        ISSN:1539-9087
        EISSN:1558-3465
        DOI:10.1145/2514641
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Journal Family

        Publication History

        Published: 30 September 2013
        Accepted: 01 August 2012
        Revised: 01 February 2012
        Received: 01 March 2011
        Published in TECS Volume 13, Issue 2

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. FPGA
        2. heterogeneous compute systems
        3. high-level synthesis
        4. high-performance computing
        5. parallel programming model
        6. source-to-source compiler

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)31
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 12 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)OpenMP Offloading and OpenACC Programming Model Approach for Object-Oriented Plasma Device Algorithms2023 46th MIPRO ICT and Electronics Convention (MIPRO)10.23919/MIPRO57284.2023.10159738(326-330)Online publication date: 22-May-2023
        • (2023)FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific CompilerACM Transactions on Architecture and Code Optimization10.1145/362952320:4(1-25)Online publication date: 25-Oct-2023
        • (2023)A Survey on FPGA-Based Heterogeneous Clusters ArchitecturesIEEE Access10.1109/ACCESS.2023.328843111(67679-67706)Online publication date: 2023
        • (2020)Architecturally truly diverse systems: A reviewFuture Generation Computer Systems10.1016/j.future.2020.03.061Online publication date: Apr-2020
        • (2019)Hybrid Quick Error Detection: Validation and Debug of SoCs Through High-Level SynthesisIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.283710338:7(1345-1358)Online publication date: Jul-2019
        • (2018)Mimic computing for password recoveryFuture Generation Computer Systems10.1016/j.future.2018.02.01884(58-77)Online publication date: Jul-2018
        • (2017)Aggressive Pipelining of Irregular Applications on Reconfigurable HardwareACM SIGARCH Computer Architecture News10.1145/3140659.308022845:2(575-586)Online publication date: 24-Jun-2017
        • (2017)Aggressive Pipelining of Irregular Applications on Reconfigurable HardwareProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080228(575-586)Online publication date: 24-Jun-2017
        • (2017)New advances of high-level synthesis for efficient and reliable hardware designIntegration10.1016/j.vlsi.2016.11.00658(189-214)Online publication date: Jun-2017
        • (2017)Exploiting vectorization in high level synthesis of nested irregular loopsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2017.03.00175:C(1-14)Online publication date: 1-Apr-2017
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media