Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Stratix 10 NX Architecture

Published: 08 August 2022 Publication History

Abstract

The advent of AI has driven the exploration of high-density low-precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, as well as the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this article, we will introduce the Stratix 10 NX device, which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft-logic fabric, a new type of DSP Block provides the dense arrays of low-precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support to support block FP16 and block FP12 numerics. All additions/accumulations can be done in INT32 or IEEE-754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multipliers that are more applicable to standard signal processing requirements.
In the AI market, the FPGA must compete directly with other types of devices, rather than occupy a unique niche. Deterministic system performance is as important as the performance of individual FPGA elements, such as logic, memory, and DSP. We will show that the feed forward datapath structures that are needed to support the typical AI matrix-vector and matrix-matrix multiplication operations can consistently close timing at over 500 MHz on a mid-speed grade device, even if all of the Tensor Blocks on the device are used. We will also show a full-chip NPU processor implementation that out performs GPUs at the same process node for a variety of AI inferencing workloads, even though it has a lower operating frequency of 365 MHz.
In terms of overall compute throughput, Stratix 10 NX is specified at 143 INT8/FP16 TOPs/FLOPs or 286 INT4/FP12 TOPS/FLOPs. Depending on the configuration, power efficiency is in the range of 1–4 TOPs or TFLOPs/W.

References

[1]
Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.
[3]
Intel. 2019. Agilex F-Series FPGAs and SoC FPGAs. Retrieved from https://www.intel.com/content/www/us/en/products/details/fpga/agilex/f-series.htmf.
[4]
Graphcore. 2020. Introducing 2nd Generation IPU Systems for AI at Scale. Retrieved from https://www.graphcore.ai/posts/introducing-second-generation-ipu-systems-for-ai-at-scale.
[5]
Xilinx. 2020. Versal ACAP Packaging and Pinouts Architecture Manual. Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am013-versal-pkg-pinout.pdf.
[6]
Xilinx. 2020. Versal: The First Adaptive Compute Acceleration Platform (ACAP)s. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf.
[8]
Wikipedia. 2021. 14 nm process. Retrieved from https://https://en.wikipedia.org/wiki/14_nm_process.
[9]
Wikipedia. 2021. 7 nm process. Retrieved from https://https://en.wikipedia.org/wiki/7_nm_process.
[10]
Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual.Retrieved from https://www.xilinx.com/support/documentation/architecture-manuals/am004-versal-dsp-engine.pdf.
[12]
D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie-Hurd, M. Bye, E. R. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Kurtz.2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 145–158.
[13]
Michaela Blott, Thomas B. Preußer, Nicholas J. Fraser, Giulio Gambardella, Kenneth O’brien, Yaman Umuroglu, Miriam Leeser, and Kees Vissers. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article 16 (Dec. 2018), 23 pages.
[14]
Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology, (ICFPT’20). IEEE, 10–19.
[15]
Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello. 2017. Compiling deep learning models for custom hardware accelerators. Retrieved from http://arxiv.org/abs/1708.00117.
[16]
Prasanth Chatarasi, Stephen Neuendorffer, Samuel Bayliss, Kees A. Vissers, and Vivek Sarkar. 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx AI engine. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–10.
[17]
Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1915–1929.
[18]
J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 1–14.
[19]
Vinayak Gokhale, Aliasger Zaidy, Andre Xian Ming Chang, and Eugenio Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1–4.
[20]
Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, and Alexander Heinecke. 2020. Optimizing deep learning recommender systems training on CPU cluster architectures. Retrieved from https://arXiv:cs.DC/2005.04680.
[21]
Martin Langhammer and Gregg Baeckler. 2018. High density and performance multiplication for FPGA. In Proceedings of the 25th IEEE Symposium on Computer Arithmetic (ARITH’18). 5–12.
[22]
Martin Langhammer, Gregg Baeckler, and Sergey Gribok. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, 202–211.
[23]
M. Langhammer, G. Baeckler, and S. Gribok. 2020. SpiderWeb—High-performance FPGA NoC. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). 115–118.
[24]
M. Langhammer, S. Finn, S. Gribok, and B. Pasca. 2021. Dense FPGA compute using signed byte tuples. In Proceedings of the 31st International Conference on Field Programmable Logic and Applications (FPL’21).
[25]
Martin Langhammer, Sergey Gribok, and Gregg Baeckler. 2020. High density 8-bit multiplier systolic arrays for FPGA. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). 84–92.
[26]
Martin Langhammer, Eriko Nurvitadhi, Bogdan Pasca, and Sergey Gribok. 2021. Stratix 10 NX architecture and applications. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 57–67.
[27]
Martin Langhammer, Bogdan Pasca, and Gregg Baeckler. 2019. High precision, high performance FPGA adders. In Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 298–306.
[28]
Martin Langhammer, Bogdan Pasca, Gregg Baeckler, and Sergey Gribok. 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, Barcelona, Spain.
[29]
Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, and Yun Liang. 2019. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 17–25.
[30]
Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-Sun Seo. 2020. Automatic compilation of diverse CNNs onto high-performance FPGA accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 2 (2020), 424–437.
[31]
B. Pasca and M. Langhammer. 2018. Activation function architectures for FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 43–437.
[32]
Ananda Samajdar, Tushar Garg, Tushar Krishna, and Nachiket Kapre. 2019. Scaling the cascades: Interconnect-aware mapping strategies for FPGA implementation of machine learning problems. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19).
[33]
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 535–547.
[34]
Runbin Shi, Yuhao Ding, Xuechao Wei, He Li, Hang Liu, Hayden K.-H. So, and Caiwen Ding. 2020. FTDL: A tailored FPGA-overlay for deep learning with high scalability. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 1–6.
[35]
Stylianos I. Venieris, Javier Fernández-Marqués, and Nicholas D. Lane. 2021. unzipFPGA: Enhancing FPGA-based CNN engines with on-the-fly weights generation. In Proceedings of the 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 165–175.
[36]
E. Wu, X. Zhang, D. Berman, and I. Cho. 2017. A high-throughput reconfigurable processing array for neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1–4.
[37]
Ephrem Wu, Xiaoqian Zhang, David Berman, Inkeun Cho, and John Thendean. 2019. Compute-efficient neural-network acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). ACM, New York, NY, 191–200.
[38]
Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. 2020. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 10 (2020), 2668–2681.
[39]
Yunxuan Yu, Tiandong Zhao, Kun Wang, and Lei He. 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, 122–132.

Cited By

View all
  • (2024)Efficient 8-bit Matrix Multiplication on Intel Agilex-5 FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00016(43-53)Online publication date: 5-May-2024
  • (2024)Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous ChipletsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473961(765-770)Online publication date: 22-Jan-2024
  • (2023)BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM57271.2023.00015(52-62)Online publication date: May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 4
December 2022
476 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3540252
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2022
Online AM: 14 March 2022
Accepted: 01 February 2022
Revised: 01 January 2022
Received: 01 September 2021
Published in TRETS Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA architecture
  2. AI tensor block
  3. FPGA accelerator
  4. place and route

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)249
  • Downloads (Last 6 weeks)12
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient 8-bit Matrix Multiplication on Intel Agilex-5 FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00016(43-53)Online publication date: 5-May-2024
  • (2024)Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous ChipletsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473961(765-770)Online publication date: 22-Jan-2024
  • (2023)BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM57271.2023.00015(52-62)Online publication date: May-2023
  • (2022)Trade-Off-Oriented Impedance Optimization of Chiplet-Based 2.5-D Integrated Circuits With a Hybrid MDP Algorithm for Noise EliminationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2022.320041069:12(5247-5258)Online publication date: Dec-2022
  • (2022)HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs2022 International Conference on Field-Programmable Technology (ICFPT)10.1109/ICFPT56656.2022.9974441(1-9)Online publication date: 5-Dec-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media