Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Accelerating Attention Mechanism on FPGAs based on Efficient Reconfigurable Systolic Array

Published: 09 November 2023 Publication History

Abstract

Transformer model architectures have recently received great interest in natural language, machine translation, and computer vision, where attention mechanisms are their building blocks. However, the attention mechanism is expensive because of its intensive matrix computations and complicated data flow. The existing hardware architecture has some disadvantages for the computing structure of attention, such as inflexibility and low efficiency. Most of the existing papers accelerate attention by reducing the amount of computation through various pruning algorithms, which will affect the results to a certain extent with different sparsity. This paper proposes the hardware accelerator for the multi-head attention (MHA) on field-programmable gate arrays (FPGAs) with reconfigurable architecture, efficient systolic array, and hardware-friendly radix-2 softmax. We propose a novel method called Four inputs Processing Element (FPE) to double the computation rate of the data-aware systolic array (SA) and make it efficient and load balance. Especially, the computation framework is well designed to ensure the utilization of SA efficiently. Our design is evaluated on a Xilinx Alveo U250 card, and the proposed architecture achieves 51.3×, 17.3× improvement in latency, and 54.4×, 17.9× energy savings compared to CPU and GPU.

References

[1]
Nagadastagiri Challapalle, Sahithi Rampalli, Makesh Chandran, Gurpreet Kalsi, Sreenivas Subramoney, John Sampson, and Vijaykrishnan Narayanan. 2020. PSB-RNN: A processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 180–185.
[2]
Yixin Chen, Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, and Belinda Zeng. 2021. Top-down attention in end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.
[3]
Hyungmin Cho. 2021. RiSA: A reinforced systolic array for depthwise convolutions and embedded tensor reshaping. ACM Transactions on Embedded Computing Systems (TECS) 20, 5s (2021), 1–20.
[4]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Luchang Ding, Zhize Huang, and Gengsheng Chen. 2019. An FPGA implementation of GCN with sparse adjacency matrix. In 2019 IEEE 13th International Conference on ASIC (ASICON). IEEE, 1–4.
[7]
Jiangjin Gao and Tao Yang. 2022. Face detection algorithm based on improved TinyYOLOv3 and attention mechanism. Computer Communications 181 (2022), 329–337.
[8]
Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12, 1 (2019), 1–26.
[9]
Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, et al. 2020. Aˆ2303 3: Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 328–341.
[10]
Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W. Lee. 2021. ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 692–705.
[11]
Wenjia He, Yu Wang, Lizhen Cui, Ran Su, and Leyi Wei. 2021. Learning embedding features based on multisense-scaled attention architecture to improve the predictive performance of anticancer peptides. Bioinformatics 37, 24 (2021), 4684–4693.
[12]
Sugil Lee, Daewoo Kim, Dong Nguyen, and Jongeun Lee. 2018. Double MAC on a DSP: Boosting the performance of convolutional neural networks on FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 5 (2018), 888–897.
[13]
Bin Li and Yuqing He. 2021. An attention mechanism oriented hybrid CNN-RNN deep learning architecture of container terminal liner handling conditions prediction. Computational Intelligence and Neuroscience (2021).
[14]
Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. Ftrans: Energy-efficient acceleration of transformers using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 175–180.
[15]
Xintong Li, Lemao Liu, Zhaopeng Tu, Guanlin Li, Shuming Shi, and Max Q-H Meng. 2021. Attending from foresight: A novel attention mechanism for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2606–2616.
[16]
Youling Li. 2020. A calibration method of computer vision system based on dual attention mechanism. Image and Vision Computing 103 (2020), 104039.
[17]
Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained transformers for text ranking: BERT and beyond. Synthesis Lectures on Human Language Technologies 14, 4 (2021), 1–325.
[18]
Dongyang Liu, Junping Zhang, Yinhu Wu, and Ye Zhang. 2021. A shadow detection algorithm based on multiscale spatial attention mechanism for aerial remote sensing images. IEEE Geoscience and Remote Sensing Letters (2021).
[19]
Zhi-Gang Liu, Paul N. Whatmough, and Matthew Mattina. 2020. Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference. IEEE Computer Architecture Letters 19, 1 (2020), 34–37.
[20]
Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 977–991.
[21]
Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. arXiv preprint arXiv:2009.08605 (2020).
[22]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models.(2017). arXiv. arXiv preprint arXiv:1609.07843 (2017).
[23]
Duy Thanh Nguyen, Tuan Nghia Nguyen, Hyun Kim, and Hyuk-Jae Lee. 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8 (2019), 1861–1873.
[24]
Hongwu Peng, Shaoyi Huang, Tong Geng, Ang Li, Weiwen Jiang, Hang Liu, Shusen Wang, and Caiwen Ding. 2021. Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED). IEEE, 142–148.
[25]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.
[26]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68.
[27]
Alexander M. Rush. 2018. The annotated transformer. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 52–60.
[28]
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2 (2019).
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[30]
Shreyas K. Venkataramanaiah, Han-Sok Suh, Shihui Yin, Eriko Nurvitadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. 2020. FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory. In Proceedings of the 39th International Conference on Computer-Aided Design. 1–8.
[31]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
[32]
Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.
[33]
Jie Wang, Licheng Guo, and Jason Cong. 2021. AutoSA: A polyhedral compiler for high-performance systolic arrays on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 93–104.
[34]
Meiqi Wang, Siyuan Lu, Danyang Zhu, Jun Lin, and Zhongfeng Wang. 2018. A high-speed and low-complexity architecture for softmax function in deep learning. In 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 223–226.
[35]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.
[36]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.
[38]
Rui Xu, Sheng Ma, Yaohua Wang, Xinhai Chen, and Yang Guo. 2021. Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Transactions on Architecture and Code Optimization (TACO) 18, 4 (2021), 1–24.
[39]
Rui Xu, Sheng Ma, Yaohua Wang, and Yang Guo. 2021. HeSA: Heterogeneous systolic array architecture for compact CNNs hardware accelerators. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 657–662.
[40]
Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 255–265.
[41]
Bingyi Zhang, Hanqing Zeng, and Viktor Prasanna. 2020. Accelerating large scale GCN inference on FPGA. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 241–241.
[42]
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In International Conference on Machine Learning. PMLR, 7354–7363.
[43]
Jingyao Zhang, Huaxi Gu, Grace Li Zhang, Bing Li, and Ulf Schlichtmann. 2021. Hardware-software codesign of weight reshaping and systolic array multiplexing for efficient CNNs. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 667–672.
[44]
Xinyi Zhang, Yawen Wu, Peipei Zhou, Xulong Tang, and Jingtong Hu. 2021. Algorithm-hardware co-design of attention mechanism on FPGA devices. ACM Transactions on Embedded Computing Systems (TECS) 20, 5s (2021), 1–24.

Cited By

View all
  • (2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
  • (2024)Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention FusionProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658810(599-603)Online publication date: 12-Jun-2024
  • (2024)BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00094(1262-1277)Online publication date: 29-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 22, Issue 6
November 2023
428 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3632298
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 09 November 2023
Online AM: 20 July 2022
Accepted: 10 July 2022
Revised: 19 May 2022
Received: 31 December 2021
Published in TECS Volume 22, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Transformer
  2. attention
  3. FPGA
  4. reconfigurable systolic array
  5. softmax
  6. Accelerator

Qualifiers

  • Research-article

Funding Sources

  • Key-Area Research and Development Program of Guangdong Province
  • Natural Science Foundation of Hunan Province
  • NSFC
  • Open Research Projects of Zhejiang Lab
  • Cultivation of Shenzhen Excellent Technological and Innovative Talents (Ph.D. Basic Research Started)
  • Basic research of Shenzhen Science and technology Plan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,679
  • Downloads (Last 6 weeks)202
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error DetectionACM Transactions on Architecture and Code Optimization10.1145/367490921:3(1-26)Online publication date: 28-Jun-2024
  • (2024)Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention FusionProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658810(599-603)Online publication date: 12-Jun-2024
  • (2024)BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00094(1262-1277)Online publication date: 29-Jun-2024
  • (2024)A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00045(178-185)Online publication date: 27-May-2024
  • (2024)High-throughput systolic array-based accelerator for hybrid transformer-CNN networksJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.10219436:8(102194)Online publication date: Oct-2024
  • (2023)High-Frequency Systolic Array-Based Transformer Accelerator on Field Programmable Gate ArraysElectronics10.3390/electronics1204082212:4(822)Online publication date: 6-Feb-2023
  • (2023)MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00016(96-105)Online publication date: 12-Dec-2023
  • (2023)FPGA Accelerating Multi-Source Transfer Learning with GAT for Bioactivities of Ligands Targeting Orphan G Protein-Coupled Receptors2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00054(317-321)Online publication date: 4-Sep-2023
  • (2022)Accelerating Transformer Neural Networks on FPGAs for High Energy Physics Experiments2022 International Conference on Field-Programmable Technology (ICFPT)10.1109/ICFPT56656.2022.9974463(1-8)Online publication date: 5-Dec-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media