research-article

Accelerating Attention Mechanism on FPGAs based on Efficient Reconfigurable Systolic Array

Authors:

Kenli LiAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 6

Article No.: 93, Pages 1 - 22

https://doi.org/10.1145/3549937

Published: 09 November 2023 Publication History

Abstract

Transformer model architectures have recently received great interest in natural language, machine translation, and computer vision, where attention mechanisms are their building blocks. However, the attention mechanism is expensive because of its intensive matrix computations and complicated data flow. The existing hardware architecture has some disadvantages for the computing structure of attention, such as inflexibility and low efficiency. Most of the existing papers accelerate attention by reducing the amount of computation through various pruning algorithms, which will affect the results to a certain extent with different sparsity. This paper proposes the hardware accelerator for the multi-head attention (MHA) on field-programmable gate arrays (FPGAs) with reconfigurable architecture, efficient systolic array, and hardware-friendly radix-2 softmax. We propose a novel method called Four inputs Processing Element (FPE) to double the computation rate of the data-aware systolic array (SA) and make it efficient and load balance. Especially, the computation framework is well designed to ensure the utilization of SA efficiently. Our design is evaluated on a Xilinx Alveo U250 card, and the proposed architecture achieves 51.3×, 17.3× improvement in latency, and 54.4×, 17.9× energy savings compared to CPU and GPU.

References

[1]

Nagadastagiri Challapalle, Sahithi Rampalli, Makesh Chandran, Gurpreet Kalsi, Sreenivas Subramoney, John Sampson, and Vijaykrishnan Narayanan. 2020. PSB-RNN: A processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 180–185.

[2]

Yixin Chen, Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, and Belinda Zeng. 2021. Top-down attention in end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.

[3]

Hyungmin Cho. 2021. RiSA: A reinforced systolic array for depthwise convolutions and embedded tensor reshaping. ACM Transactions on Embedded Computing Systems (TECS) 20, 5s (2021), 1–20.

Digital Library

[4]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Luchang Ding, Zhize Huang, and Gengsheng Chen. 2019. An FPGA implementation of GCN with sparse adjacency matrix. In 2019 IEEE 13th International Conference on ASIC (ASICON). IEEE, 1–4.

[7]

Jiangjin Gao and Tao Yang. 2022. Face detection algorithm based on improved TinyYOLOv3 and attention mechanism. Computer Communications 181 (2022), 329–337.

Digital Library

[8]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12, 1 (2019), 1–26.

Digital Library

[9]

Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, et al. 2020. Aˆ2303 3: Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 328–341.

[10]

Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W. Lee. 2021. ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 692–705.

Digital Library

[11]

Wenjia He, Yu Wang, Lizhen Cui, Ran Su, and Leyi Wei. 2021. Learning embedding features based on multisense-scaled attention architecture to improve the predictive performance of anticancer peptides. Bioinformatics 37, 24 (2021), 4684–4693.

[12]

Sugil Lee, Daewoo Kim, Dong Nguyen, and Jongeun Lee. 2018. Double MAC on a DSP: Boosting the performance of convolutional neural networks on FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 5 (2018), 888–897.

[13]

Bin Li and Yuqing He. 2021. An attention mechanism oriented hybrid CNN-RNN deep learning architecture of container terminal liner handling conditions prediction. Computational Intelligence and Neuroscience (2021).

[14]

Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. Ftrans: Energy-efficient acceleration of transformers using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 175–180.

Digital Library

[15]

Xintong Li, Lemao Liu, Zhaopeng Tu, Guanlin Li, Shuming Shi, and Max Q-H Meng. 2021. Attending from foresight: A novel attention mechanism for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2606–2616.

Digital Library

[16]

Youling Li. 2020. A calibration method of computer vision system based on dual attention mechanism. Image and Vision Computing 103 (2020), 104039.

[17]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained transformers for text ranking: BERT and beyond. Synthesis Lectures on Human Language Technologies 14, 4 (2021), 1–325.

[18]

Dongyang Liu, Junping Zhang, Yinhu Wu, and Ye Zhang. 2021. A shadow detection algorithm based on multiscale spatial attention mechanism for aerial remote sensing images. IEEE Geoscience and Remote Sensing Letters (2021).

[19]

Zhi-Gang Liu, Paul N. Whatmough, and Matthew Mattina. 2020. Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference. IEEE Computer Architecture Letters 19, 1 (2020), 34–37.

[20]

Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 977–991.

[21]

Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. arXiv preprint arXiv:2009.08605 (2020).

[22]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models.(2017). arXiv. arXiv preprint arXiv:1609.07843 (2017).

[23]

Duy Thanh Nguyen, Tuan Nghia Nguyen, Hyun Kim, and Hyuk-Jae Lee. 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8 (2019), 1861–1873.

Digital Library

[24]

Hongwu Peng, Shaoyi Huang, Tong Geng, Ang Li, Weiwen Jiang, Hang Liu, Shusen Wang, and Caiwen Ding. 2021. Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED). IEEE, 142–148.

[25]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.

[26]

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68.

[27]

Alexander M. Rush. 2018. The annotated transformer. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 52–60.

[28]

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2 (2019).

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.

[30]

Shreyas K. Venkataramanaiah, Han-Sok Suh, Shihui Yin, Eriko Nurvitadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. 2020. FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory. In Proceedings of the 39th International Conference on Computer-Aided Design. 1–8.

Digital Library

[31]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).

[32]

Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.

[33]

Jie Wang, Licheng Guo, and Jason Cong. 2021. AutoSA: A polyhedral compiler for high-performance systolic arrays on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 93–104.

Digital Library

[34]

Meiqi Wang, Siyuan Lu, Danyang Zhu, Jun Lin, and Zhongfeng Wang. 2018. A high-speed and low-complexity architecture for softmax function in deep learning. In 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 223–226.

[35]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.

[36]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.

Digital Library

[37]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.

Digital Library

[38]

Rui Xu, Sheng Ma, Yaohua Wang, Xinhai Chen, and Yang Guo. 2021. Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Transactions on Architecture and Code Optimization (TACO) 18, 4 (2021), 1–24.

Digital Library

[39]

Rui Xu, Sheng Ma, Yaohua Wang, and Yang Guo. 2021. HeSA: Heterogeneous systolic array architecture for compact CNNs hardware accelerators. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 657–662.

[40]

Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 255–265.

Digital Library

[41]

Bingyi Zhang, Hanqing Zeng, and Viktor Prasanna. 2020. Accelerating large scale GCN inference on FPGA. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 241–241.

[42]

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In International Conference on Machine Learning. PMLR, 7354–7363.

[43]

Jingyao Zhang, Huaxi Gu, Grace Li Zhang, Bing Li, and Ulf Schlichtmann. 2021. Hardware-software codesign of weight reshaping and systolic array multiplexing for efficient CNNs. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 667–672.

[44]

Xinyi Zhang, Yawen Wu, Peipei Zhou, Xulong Tang, and Jingtong Hu. 2021. Algorithm-hardware co-design of attention mechanism on FPGA devices. ACM Transactions on Embedded Computing Systems (TECS) 20, 5s (2021), 1–24.

Digital Library

Cited By

Liu PWang Y(2025)A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary DataflowMicromachines10.3390/mi1601010116:1(101)Online publication date: 16-Jan-2025
https://doi.org/10.3390/mi16010101
Taka EHuang NChang CWu KArora AMarculescu DPutnam ALi J(2025)Systolic Sparse Tensor Slices: FPGA Building Blocks for Sparse and Dense AI AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708867(159-171)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708867
Liu HLu XYu XLi KYang KXia HLi SDeng T(2025)A 3-D Multi-Precision Scalable Systolic FMA ArchitectureIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.349772472:1(265-276)Online publication date: Jan-2025
https://doi.org/10.1109/TCSI.2024.3497724
Show More Cited By

Recommendations

A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the ...
Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization
GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSI

Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The ...
A Survey of Design and Optimization for Systolic Array-based DNN Accelerators
In recent years, it has been witnessed that the systolic array is a successful architecture for DNN hardware accelerators. However, the design of systolic arrays also encountered many challenges. As DNN structures and applications become more complex, a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 6

November 2023

428 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3632298

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 09 November 2023

Online AM: 20 July 2022

Accepted: 10 July 2022

Revised: 19 May 2022

Received: 31 December 2021

Published in TECS Volume 22, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key-Area Research and Development Program of Guangdong Province
Natural Science Foundation of Hunan Province
NSFC
Open Research Projects of Zhejiang Lab
Cultivation of Shenzhen Excellent Technological and Innovative Talents (Ph.D. Basic Research Started)
Basic research of Shenzhen Science and technology Plan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
2,876
Total Downloads

Downloads (Last 12 months)1,373
Downloads (Last 6 weeks)151

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu PWang Y(2025)A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary DataflowMicromachines10.3390/mi1601010116:1(101)Online publication date: 16-Jan-2025
https://doi.org/10.3390/mi16010101
Taka EHuang NChang CWu KArora AMarculescu DPutnam ALi J(2025)Systolic Sparse Tensor Slices: FPGA Building Blocks for Sparse and Dense AI AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708867(159-171)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708867
Liu HLu XYu XLi KYang KXia HLi SDeng T(2025)A 3-D Multi-Precision Scalable Systolic FMA ArchitectureIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.349772472:1(265-276)Online publication date: Jan-2025
https://doi.org/10.1109/TCSI.2024.3497724
Chang SKim D(2024)Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant ApplicationsElectronics10.3390/electronics1323468313:23(4683)Online publication date: 27-Nov-2024
https://doi.org/10.3390/electronics13234683
Wei XWang CYue HTan JGuan ZJiang NZheng XZhao JQiu M(2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3674909
Qin YLou WWang CGong LZhou X(2024)Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention FusionProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658810(599-603)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3658810
Zhao JShen GDing WChen QGuo M(2024)Automatic Mapping of Heterogeneous DNN Models on Adaptive Multiaccelerator SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.341084143:12(4701-4714)Online publication date: Dec-2024
https://doi.org/10.1109/TCAD.2024.3410841
Yu ZLiang SMa TCai YNan ZHuang DSong XHao YZhang JZhi TZhao YDu ZHu XGuo QChen T(2024)Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00108(1474-1488)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00108
Feng YMa TZhu YZhang X(2024)BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00094(1262-1277)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00094
Wu JSong MZhao JSo H(2024)A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00045(178-185)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00045
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents