short-paper

Highly Efficient Load-Balanced Dataflow for SpGEMMs on Systolic Arrays

Authors:

Guocheng Zhao,

Yongxiang CaoAuthors Info & Claims

GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

Pages 272 - 276

https://doi.org/10.1145/3649476.3658777

Published: 12 June 2024 Publication History

Get Access

Abstract

To enhance the efficiency of sparse neural network models, compression methods are commonly employed to store the non-zero elements in a sparse storage format. Sparse General Matrix Multiplication (SpGEMM) is a critical computation in deep neural networks. However, when utilizing systolic arrays for SpGEMM computations, a challenge arises due to the irregular flow of compressed, non-zero element activation data. This irregularity leads to varying lengths of activation data streams entering the systolic array per batch, potentially resulting in the underutilization of processing units. Our research focuses on repackaging compressed data streams using hardware-software co-design to minimize software pre-processing time. We also package unevenly sized sparse matrix rows post-compression into multiple groups of activation data streams with approximately equal lengths. This approach aims to evenly distribute the workload across the fixed output of the systolic array, thereby improving the utilization rate of Processing Elements (PE). Our evaluation demonstrates that our method achieves a 2.01x acceleration compared to uncompressed sparse data streams and a 3.63x average acceleration relative to TPUs. Furthermore, compared to the state-of-the-art SpGEMM accelerator SADD, our approach achieves an average of 2.01x acceleration.

References

[1]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 58–70. https://doi.org/10.1109/HPCA47549.2020.00015

Crossref

Google Scholar

[2]

Norman P. Jouppi, Cliff Young, Nishant Patil, and David A. Patterson. 2018. Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro 38, 3 (2018), 10–19. https://doi.org/10.1109/MM.2018.032271057

Crossref

Google Scholar

[3]

Hesam Shabani, Abhishek Singh, Bishoy Youhana, and Xiaochen Guo. 2023. HIRAC: A Hierarchical Accelerator with Sorting-based Packing for SpGEMMs in DNN Applications. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2023, Montreal, QC, Canada, February 25 - March 1, 2023. IEEE, 247–258. https://doi.org/10.1109/HPCA56546.2023.10070977

Crossref

Google Scholar

[4]

Wenhao Sun, Deng Liu, Zhiwei Zou, Wendi Sun, Song Chen, and Yi Kang. 2023. Sense: Model-Hardware Codesign for Accelerating Sparse CNNs on Systolic Arrays. IEEE Trans. Very Large Scale Integr. Syst. 31, 4 (2023), 470–483. https://doi.org/10.1109/TVLSI.2023.3241933

Digital Library

Google Scholar

[5]

Minjin Tang, Mei Wen, Yasong Cao, Junzhong Shen, Jianchao Yang, Jiawei Fei, Yang Guo, and Sheng Liu. 2022. Mentha: Enabling Sparse-Packing Computation on Systolic Arrays. In Proceedings of the 51st International Conference on Parallel Processing, ICPP 2022, Bordeaux, France, 29 August 2022 - 1 September 2022. ACM, 18:1–18:11. https://doi.org/10.1145/3545008.3545053

Digital Library

Google Scholar

[6]

Bo Wang, Sheng Ma, Zhong Liu, Libo Huang, Yuan Yuan, and Yi Dai. 2022. SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning. In Network and Parallel Computing - 19th IFIP WG 10.3 International Conference, NPC 2022, Jinan, China, September 24-25, 2022, Proceedings(Lecture Notes in Computer Science, Vol. 13615), Shaoshan Liu and Xiaohui Wei (Eds.). Springer, 42–53. https://doi.org/10.1007/978-3-031-21395-3_4

Digital Library

Google Scholar

[7]

Xinheng Liu, Yao Chen, Cong Hao, Ashutosh Dhar, and Deming Chen. 2021. WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs. In 32nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2021, Virtual Conference, USA, July 7-9, 2021. IEEE, 258–265. https://doi.org/10.1109/ASAP52443.2021.00045

Crossref

Google Scholar

[8]

Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A Systematic Survey of General Sparse Matrix-matrix Multiplication. ACM Comput. Surv. 55, 12 (2023), 244:1–244:36. https://doi.org/10.1145/3571157

Digital Library

Google Scholar

[9]

Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. 2020. Sparse-TPU: adapting systolic arrays for sparse matrices. In ICS ’20: 2020 International Conference on Supercomputing, Barcelona Spain, June, 2020, Eduard Ayguadé, Wen-mei W. Hwu, Rosa M. Badia, and H. Peter Hofstee (Eds.). ACM, 19:1–19:12. https://doi.org/10.1145/3392717.3392751

Digital Library

Google Scholar

Index Terms

Highly Efficient Load-Balanced Dataflow for SpGEMMs on Systolic Arrays
1. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

Mentha: Enabling Sparse-Packing Computation on Systolic Arrays
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a critical kernel in domains like graph analytic and scientific computation. As a kind of classical special-purpose architecture, systolic arrays were first used for complex computing problems,...
Fault-Tolerant Matrix Triangularizations on Systolic Arrays

Examines the checksum methods of Abraham et al. for LU decomposition on multiprocessor arrays. Their methods are efficient for detecting a transient error, but expensive for correcting it due to the need for a computation rollback. The authors show how ...
Programmable systolic arrays

Comments

Information & Contributors

Information

Published In

GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

June 2024

797 pages

ISBN:9798400706059

DOI:10.1145/3649476

Editors:
Inna Partin-Vaisband
University of Illinois Chicago, USA
,
Srinivas Katkoori
University of South Florida, USA
,
Lu Peng
Tulane University, USA
,
Boris Vaisband
McGill University, Canada
,
Tooraj Nikoubin
University of Texas at Dallas, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

National Key Research and Development Program of China

Conference

GLSVLSI '24

Sponsor:

SIGDA

GLSVLSI '24: Great Lakes Symposium on VLSI 2024

June 12 - 14, 2024

FL, Clearwater, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
35
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)33

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Mentha: Enabling Sparse-Packing Computation on Systolic Arrays

Fault-Tolerant Matrix Triangularizations on Systolic Arrays

Programmable systolic arrays

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations