research-article

Specializing FGPU for Persistent Deep Learning

Authors:

Eriko Nurvitadhi,

David Sheffield,

Martin Langhammer,

Derek ChiouAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 14, Issue 2

Article No.: 10, Pages 1 - 23

https://doi.org/10.1145/3457886

Published: 15 July 2021 Publication History

Abstract

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL)-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

References

[1]

Muhammed Al Kadi, Benedikt Janssen, Jones Yudi, and Michael Huebner. 2018. General-Purpose computing with soft GPUs on FPGAs. ACM Transactions on Reconfigurable Technology and Systems 11, 1 (Jan. 2018), Article 5, 22 pages.

Digital Library

[2]

E. Nurvitadhi, D. Kwon, A. Jafari, A. Boutros, J. Sim, P. Tomson, H. Sumbul, G. Chen, P. Knag, R. Kumar, R. Krishnamurthy, S. Gribok, B. Pasca, M. Langhammer, D. Marr, and A. Dasu. 2019. Why compete when you can work together: FPGA-ASIC integration for persistent RNNs. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 199–207.

[3]

Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, and Derek Chiou. 2019. Specializing FGPU for persistent deep learning. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 326–333.

[4]

Intel Corporation. 2020. Open Programmable Acceleration Engine. Retrieved on Jun 20, 2019 from https://01.org/opae.

[5]

R. Dey and F. M. Salem. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). 1597–1600.

[6]

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In International Conference on Machine Learning. 2024–2033.

Digital Library

[7]

Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard, and Fung Xie. 2018. Sparse persistent RNNs: Squeezing large recurrent networks on-chip. In International Conference on Learning Representations. https://openreview.net/forum?id=HkxF5RgC-

[8]

Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-latency and scalable RNN inference on GPUs. In Proceedings of the 14th EuroSys Conference 2019. 1–16.

Digital Library

[9]

Intel Corporation. 2018. Intel® 64 and IA-32 Architectures Software Developer’s Manual.

[10]

PDL-FGPU Kernel Sources. Retrieved on Jun 20, 2019 from https://github.com/paleolithicman/PDL-FGPU_kernels.

[11]

MIPS Technologies. 2001. MIPS32® Architecture For Programmers Volume II: The MIPS32® Instruction Set.

[12]

Baidu. 2020. DeepBench. Retrieved on Jun 20, 2019 from https://github.com/baidu-research/DeepBench.

[13]

2018. cuDNN Developer Guide. Retrieved on Jun 20, 2019 from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html.

[14]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078.

[15]

Joao Canas Ferreira and Jose Fonseca. 2016. An FPGA implementation of a long short-term memory neural network. In 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig’16). IEEE, 1–8.

[16]

Yijin Guan, Zhihang Yuan, Guangyu Sun, and Jason Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 629–634.

[17]

Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 1390–1395.

Digital Library

[18]

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 75–84.

Digital Library

[19]

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.

[20]

Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. CoRR abs/1807.04093 (2018). arxiv:1807.04093. Retrieved on Jun 20, 2019 from http://arxiv.org/abs/1807.04093.

[21]

E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Y. Xiao, D. Zhang, R. Zhao, and D. Burger. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (Mar. 2018), 8–20.

[22]

Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Xinyu Niu, and Wayne Luk. 2020. Optimizing reconfigurable recurrent neural networks. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 10–18.

[23]

Zhiqiang Que, Yongxin Zhu, Hongxiang Fan, Jiuxi Meng, Xinyu Niu, and Wayne Luk. 2020. Mapping large LSTMs to FPGAs with weight reuse. Journal of Signal Processing Systems 92 (2020), 965-979.

[24]

Daniele Bagni, A. Di Fresco, J. Noguera, and F. M. Vallina. 2016. A Zynq accelerator for floating point matrix multiplication designed with vivado HLS. Application Note (2016), 39–41.

[25]

A. Severance and G. G. F. Lemieux. 2013. Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor. In 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’13). 1–10.

Digital Library

[26]

Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’08). ACM, New York, NY, 61–70.

Digital Library

[27]

J. Kingyens and J. Gregory Steffan. 2010. A GPU-inspired soft processor for high-throughput acceleration. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW’10). 1–8.

[28]

R. Balasubramanian, V. Gangadhar, Z. Guo, C. Ho, C. Joseph, J. Menon, M. P. Drumond, R. Paul, S. Prasad, P. Valathol, and K. Sankaralingam. 2015. MIAOW—An open source RTL implementation of a GPGPU. In 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII). 1–3.

[29]

Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. DeepCPU: Serving RNN-based deep learning models 10x faster. In 2018 USENIX Annual Technical Conference (USENIX ATC’18). 951–965.

Digital Library

Cited By

Kim JKim T(2023)ROSETTA: A Resource and Energy-Efficient Inference Processor for Recurrent Neural Networks Based on Programmable Data Formats and Fine Activation PruningIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.323096111:3(650-663)Online publication date: 1-Jul-2023
https://doi.org/10.1109/TETC.2022.3230961
Perez TGoncalves MGobatto LBrandalero MAzambuja JPagliarini S(2022)G-GPU: A Fully-Automated Generator of GPU-like ASIC Accelerators2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774758(544-547)Online publication date: 14-Mar-2022
https://doi.org/10.23919/DATE54114.2022.9774758
Que ZNakahara HFan HLi HMeng JTsoi KNiu XNurvitadhi ELuk W(2022)Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural NetworksACM Transactions on Reconfigurable Technology and Systems10.1145/353496916:1(1-26)Online publication date: 22-Dec-2022
https://dl.acm.org/doi/10.1145/3534969
Show More Cited By

Index Terms

Specializing FGPU for Persistent Deep Learning

Recommendations

FGPU: An SIMT-Architecture for FPGAs
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Driven by its high flexibility, good performance and energy efficiency, GPGPU has taken on an increasingly important role in embedded systems. In this paper, we present the basic core of FGPU: a GPU-like, scalable and portable integer soft SIMT-...
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 14, Issue 2

June 2021

107 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3468069

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 July 2021

Accepted: 01 March 2021

Revised: 01 November 2020

Received: 01 December 2019

Published in TRETS Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
405
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim JKim T(2023)ROSETTA: A Resource and Energy-Efficient Inference Processor for Recurrent Neural Networks Based on Programmable Data Formats and Fine Activation PruningIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.323096111:3(650-663)Online publication date: 1-Jul-2023
https://doi.org/10.1109/TETC.2022.3230961
Perez TGoncalves MGobatto LBrandalero MAzambuja JPagliarini S(2022)G-GPU: A Fully-Automated Generator of GPU-like ASIC Accelerators2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774758(544-547)Online publication date: 14-Mar-2022
https://doi.org/10.23919/DATE54114.2022.9774758
Que ZNakahara HFan HLi HMeng JTsoi KNiu XNurvitadhi ELuk W(2022)Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural NetworksACM Transactions on Reconfigurable Technology and Systems10.1145/353496916:1(1-26)Online publication date: 22-Dec-2022
https://dl.acm.org/doi/10.1145/3534969
Savio PScionti AVitali GViviani PVercellino CTerzo ONguyen HMagarielli DSpano EMarconcini MPoli F(2022)Accelerating legacy applications with spatial computing devicesThe Journal of Supercomputing10.1007/s11227-022-04925-279:7(7461-7483)Online publication date: 29-Nov-2022
https://dl.acm.org/doi/10.1007/s11227-022-04925-2
Su CGang YJin C(2021)Genetic Algorithm based Edge Computing Scheduling Strategy2021 4th International Conference on Data Science and Information Technology10.1145/3478905.3478932(130-134)Online publication date: 23-Jul-2021
https://dl.acm.org/doi/10.1145/3478905.3478932

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents