poster

Public Access

XFER: A Novel Design to Achieve Super-Linear Performance on Multiple FPGAs for Real-Time AI

Authors:

Jingtong HuAuthors Info & Claims

FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Page 305

https://doi.org/10.1145/3289602.3293988

Published: 20 February 2019 Publication History

Abstract

Real-time inference with low latency requirement has become increasingly important for numerous applications in both cloud computing and edge computing. The FPGA-based Deep Neural Network (DNN) accelerators have demonstrated the superior performance and energy efficiency over CPUs and GPUs; in addition, for real-time AI with low batch size, FPGA is expected to achieve further performance improvement over the general purpose computing platform. However, the performance gain of the single-FPGA design is hindered by the limited on-chip resource. In this paper, we leverage a cluster of FPGAs to fully exploit the parallelism in DNNs with the objective of obtaining super-linear performance. To achieve this goal, a novel design, "XFER", is proposed to deploy DNNs to FPGA cluster by splitting the DNN layer to multiple FPGAs and moving traffics from memory bus to inter-FPGA links. The resultant system can achieve both workload balance and traffic balance. As a case study, we implement Convolutional Neural Networks (CNNs) on ZCU102 FPGA boards. Evaluation results demonstrate that XFER on two FPGAs can achieve 3.48x speedup compared with state-of-the-art FPGA designs, achieving super-linear speedup.

References

[1]

E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, et al. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro, 38(2):8--20, 2018.

Crossref

Google Scholar

[2]

Y. Ding, J. Liu, and Y. Shi. On the universal approximability of quantized relu neural networks. arXiv preprint arXiv:1802.03646, 2018.

Google Scholar

[3]

J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, et al. A configurable cloud-scale dnn processor for real-time ai. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 1--14. IEEE Press, 2018.

Digital Library

Google Scholar

[4]

W. Jiang, E. H.-M. Sha, Q. Zhuge, L. Yang, X. Chen, and J. Hu. Heterogeneous fpga-based cost-optimal design for timing-constrained cnns. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):2542--2554, 2018.

Crossref

Google Scholar

[5]

Y. Shen, M. Ferdman, and P. Milder. Maximizing cnn accelerator efficiency through resource partitioning. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 535--547. IEEE, 2017.

Digital Library

Google Scholar

[6]

X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for edge inference of deep neural networks. Nature Electronics, 1(4):216, 2018.

Crossref

Google Scholar

[7]

X. Xu, Q. Lu, T. Wang, J. Liu, C. Zhuo, X. S. Hu, and Y. Shi. Edge segmentation: Empowering mobile telemedicine with compressed cellular neural networks. In Proceedings of the 36th International Conference on Computer-Aided Design, pages 880--887. IEEE Press, 2017.

Digital Library

Google Scholar

[8]

L. Yang, W. Liu, W. Jiang, M. Li, P. Chen, and E. H.-M. Sha. Fotonoc: A folded torus-like network-on-chip based many-core systems-on-chip in the dark silicon era. IEEE Transactions on Parallel and Distributed Systems, 28(7):1905--1918, 2017.

Digital Library

Google Scholar

[9]

C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161--170. ACM, 2015.

Digital Library

Google Scholar

[10]

C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong. Energy-efficient cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pages 326--331. ACM, 2016.

Digital Library

Google Scholar

Cited By

View all

Huang STang ELi SPing XChen R(2022)Hardware-friendly compression and hardware acceleration for transformer: A surveyElectronic Research Archive10.3934/era.202219230:10(3755-3785)Online publication date: 2022
https://doi.org/10.3934/era.2022192
Manu DHuang SDing CYang LChen YZhirnov VSasan ASavidis I(2021)Co-Exploration of Graph Neural Network and Network-on-Chip Design Using AutoMLProceedings of the 2021 Great Lakes Symposium on VLSI10.1145/3453688.3461741(175-180)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3453688.3461741
Qi PSong YPeng HHuang SZhuge QSha EChen YZhirnov VSasan ASavidis I(2021)Accommodating Transformer onto FPGAProceedings of the 2021 Great Lakes Symposium on VLSI10.1145/3453688.3461739(163-168)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3453688.3461739
Show More Cited By

Index Terms

XFER: A Novel Design to Achieve Super-Linear Performance on Multiple FPGAs for Real-Time AI
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
    2. Parallel architectures
      1. Interconnection architectures
  2. Real-time systems
    1. Real-time system architecture

Recommendations

Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow
Special Section on Field Programmable Logic and Applications 2015 and Regular Papers

In this article, we consider implementing field-programmable gate arrays (FPGAs) using a standard cell design methodology and present a framework for the automated generation of synthesizable FPGA fabrics. The open-source Verilog-to-Routing (VTR) FPGA ...
Embedded Design Using Programmable Gate Arrays
Embedded SoPC Design with Nios II Processor and Verilog Examples

Comments

Information & Contributors

Information

Published In

FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2019

360 pages

ISBN:9781450361378

DOI:10.1145/3289602

General Chair:
Kia Bazargan
Univ. of Minnesota, USA
,
Program Chair:
Stephen Neuendorffer
Xilinx, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2019

Check for updates

Author Tags

Qualifiers

Poster

Funding Sources

National Natural Science Foundation of China
National Science Foundation
China Scholarship Council

Conference

FPGA '19

Sponsor:

SIGDA

FPGA '19: The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 24 - 26, 2019

CA, Seaside, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Huang STang ELi SPing XChen R(2022)Hardware-friendly compression and hardware acceleration for transformer: A surveyElectronic Research Archive10.3934/era.202219230:10(3755-3785)Online publication date: 2022
https://doi.org/10.3934/era.2022192
Manu DHuang SDing CYang LChen YZhirnov VSasan ASavidis I(2021)Co-Exploration of Graph Neural Network and Network-on-Chip Design Using AutoMLProceedings of the 2021 Great Lakes Symposium on VLSI10.1145/3453688.3461741(175-180)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3453688.3461741
Qi PSong YPeng HHuang SZhuge QSha EChen YZhirnov VSasan ASavidis I(2021)Accommodating Transformer onto FPGAProceedings of the 2021 Great Lakes Symposium on VLSI10.1145/3453688.3461739(163-168)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3453688.3461739
Qi PSha EZhuge QPeng HHuang SKong ZSong YLi B(2021)Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643586(1-9)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1109/ICCAD51958.2021.9643586
Hu YShao CLi H(2021)Energy-Efficient Deep Neural Networks Implementation on a Scalable Heterogeneous FPGA Cluster2021 IEEE 15th International Conference on Anti-counterfeiting, Security, and Identification (ASID)10.1109/ASID52932.2021.9651719(10-15)Online publication date: 29-Oct-2021
https://doi.org/10.1109/ASID52932.2021.9651719
Kao SJeong GKrishna T(2020)ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00058(622-636)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00058
Zhang XJiang WShi YHu J(2019)When Neural Architecture Search Meets Hardware Implementation: from Hardware Awareness to Co-Design2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2019.00014(25-30)Online publication date: Jul-2019
https://doi.org/10.1109/ISVLSI.2019.00014
Liang HShao CQiang HLi H(2019)A Module-Level Pipeline Implementation Based on Inter-Board Heterogeneous2019 IEEE 4th International Conference on Integrated Circuits and Microsystems (ICICM)10.1109/ICICM48536.2019.8977153(280-286)Online publication date: Oct-2019
https://doi.org/10.1109/ICICM48536.2019.8977153
Liu ZZhuang JXu XLiu TLiu QWang YShi YWen WHuang MYuan H(2019)Machine Vision Guided 3D Medical Image Compression for Efficient Transmission and Accurate Segmentation in the Clouds2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2019.01297(12679-12688)Online publication date: Jun-2019
https://doi.org/10.1109/CVPR.2019.01297
Yang LJiang WLiu WSha EShi YHu J(undefined)Co-Exploring Neural Architecture and Network-on-Chip Design for Real-Time Artificial Intelligence2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC47756.2020.9045595(85-90)
https://dl.acm.org/doi/10.1109/ASP-DAC47756.2020.9045595

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow

Embedded Design Using Programmable Gate Arrays

Embedded SoPC Design with Nios II Processor and Verilog Examples

Comments

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

Abstract

References

Cited By

Index Terms

Recommendations

Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow

Embedded Design Using Programmable Gate Arrays

Embedded SoPC Design with Nios II Processor and Verilog Examples

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations