Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3649476.3658777acmconferencesArticle/Chapter ViewAbstractPublication PagesglsvlsiConference Proceedingsconference-collections
short-paper

Highly Efficient Load-Balanced Dataflow for SpGEMMs on Systolic Arrays

Published: 12 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    To enhance the efficiency of sparse neural network models, compression methods are commonly employed to store the non-zero elements in a sparse storage format. Sparse General Matrix Multiplication (SpGEMM) is a critical computation in deep neural networks. However, when utilizing systolic arrays for SpGEMM computations, a challenge arises due to the irregular flow of compressed, non-zero element activation data. This irregularity leads to varying lengths of activation data streams entering the systolic array per batch, potentially resulting in the underutilization of processing units. Our research focuses on repackaging compressed data streams using hardware-software co-design to minimize software pre-processing time. We also package unevenly sized sparse matrix rows post-compression into multiple groups of activation data streams with approximately equal lengths. This approach aims to evenly distribute the workload across the fixed output of the systolic array, thereby improving the utilization rate of Processing Elements (PE). Our evaluation demonstrates that our method achieves a 2.01x acceleration compared to uncompressed sparse data streams and a 3.63x average acceleration relative to TPUs. Furthermore, compared to the state-of-the-art SpGEMM accelerator SADD, our approach achieves an average of 2.01x acceleration.

    References

    [1]
    Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 58–70. https://doi.org/10.1109/HPCA47549.2020.00015
    [2]
    Norman P. Jouppi, Cliff Young, Nishant Patil, and David A. Patterson. 2018. Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro 38, 3 (2018), 10–19. https://doi.org/10.1109/MM.2018.032271057
    [3]
    Hesam Shabani, Abhishek Singh, Bishoy Youhana, and Xiaochen Guo. 2023. HIRAC: A Hierarchical Accelerator with Sorting-based Packing for SpGEMMs in DNN Applications. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2023, Montreal, QC, Canada, February 25 - March 1, 2023. IEEE, 247–258. https://doi.org/10.1109/HPCA56546.2023.10070977
    [4]
    Wenhao Sun, Deng Liu, Zhiwei Zou, Wendi Sun, Song Chen, and Yi Kang. 2023. Sense: Model-Hardware Codesign for Accelerating Sparse CNNs on Systolic Arrays. IEEE Trans. Very Large Scale Integr. Syst. 31, 4 (2023), 470–483. https://doi.org/10.1109/TVLSI.2023.3241933
    [5]
    Minjin Tang, Mei Wen, Yasong Cao, Junzhong Shen, Jianchao Yang, Jiawei Fei, Yang Guo, and Sheng Liu. 2022. Mentha: Enabling Sparse-Packing Computation on Systolic Arrays. In Proceedings of the 51st International Conference on Parallel Processing, ICPP 2022, Bordeaux, France, 29 August 2022 - 1 September 2022. ACM, 18:1–18:11. https://doi.org/10.1145/3545008.3545053
    [6]
    Bo Wang, Sheng Ma, Zhong Liu, Libo Huang, Yuan Yuan, and Yi Dai. 2022. SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning. In Network and Parallel Computing - 19th IFIP WG 10.3 International Conference, NPC 2022, Jinan, China, September 24-25, 2022, Proceedings(Lecture Notes in Computer Science, Vol. 13615), Shaoshan Liu and Xiaohui Wei (Eds.). Springer, 42–53. https://doi.org/10.1007/978-3-031-21395-3_4
    [7]
    Xinheng Liu, Yao Chen, Cong Hao, Ashutosh Dhar, and Deming Chen. 2021. WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs. In 32nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2021, Virtual Conference, USA, July 7-9, 2021. IEEE, 258–265. https://doi.org/10.1109/ASAP52443.2021.00045
    [8]
    Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, and Yizhuo Wang. 2023. A Systematic Survey of General Sparse Matrix-matrix Multiplication. ACM Comput. Surv. 55, 12 (2023), 244:1–244:36. https://doi.org/10.1145/3571157
    [9]
    Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. 2020. Sparse-TPU: adapting systolic arrays for sparse matrices. In ICS ’20: 2020 International Conference on Supercomputing, Barcelona Spain, June, 2020, Eduard Ayguadé, Wen-mei W. Hwu, Rosa M. Badia, and H. Peter Hofstee (Eds.). ACM, 19:1–19:12. https://doi.org/10.1145/3392717.3392751

    Index Terms

    1. Highly Efficient Load-Balanced Dataflow for SpGEMMs on Systolic Arrays

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024
      June 2024
      797 pages
      ISBN:9798400706059
      DOI:10.1145/3649476
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Workload balance
      2. dataflow packaging algorithm
      3. software and hardware co-design.
      4. spgemm
      5. systolic arrays

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Funding Sources

      • National Key Research and Development Program of China

      Conference

      GLSVLSI '24
      Sponsor:
      GLSVLSI '24: Great Lakes Symposium on VLSI 2024
      June 12 - 14, 2024
      FL, Clearwater, USA

      Acceptance Rates

      Overall Acceptance Rate 312 of 1,156 submissions, 27%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 35
        Total Downloads
      • Downloads (Last 12 months)35
      • Downloads (Last 6 weeks)33
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media