SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow
Abstract
1 Introduction
2 Background and Motivation
2.1 Deep Learning Workload Characteristics
2.1.1 GEMM.
2.1.2 Sparsity in Deep Learning Workloads.
2.2 Typical GEMM Accelerators
2.2.1 TPU.
2.2.2 SIGMA.
2.3 Inefficiency of the TPU and SIGMA
3 Analysis of Dataflows in Popular Acceleration Architecture
3.1 Introduction of Dataflows
3.1.1 SA-OS.
3.1.2 SA-IS.
3.1.3 SA-WS.
3.1.4 SIGMA-WS.
3.1.5 SIGMA-IS.
3.2 Performance Analysis of Dataflows
3.2.1 Dataflow Analysis in the Systolic Array.
3.2.2 Dataflow Analysis in SIGMA.
4 The Architecture of SparGD
4.1 Microarchitecture
4.1.1 PE and PE Groups.
4.1.2 Distribution Network.
4.1.3 Reduction Networks.
4.2 Dynamic Dataflow Design
4.2.1 Dataflows in SparGD.
4.2.2 Dataflow Analysis of SparGD.
4.2.3 Dataflow Switching Module.
4.3 Example
5 Evaluation
5.1 Experimental Setup
5.1.1 Experimental Method.
5.1.2 Experimental Platform.
5.1.3 Workloads.
Workloads | M Size | K Size | N Size | Sparsity (M-K Sparsity, K-N Sparsity) | |
---|---|---|---|---|---|
Set 0 | ch4-4-b3 \(\times\) ch4-4-b2 | 24 | 96 | 72 | 96.8%, 96.8% |
Set 1 | ch5-5-b2 \(\times\) ch5-5-b1 | 600 | 200 | 25 | 98%, 92% |
Set 2 | klein-b2 \(\times\) klein-b1 | 20 | 30 | 10 | 90%, 80% |
Set 3 | n2c6-b2 \(\times\) n2c6-b1 | 455 | 105 | 15 | 97.1%, 86.6% |
Set 4 | n3c5-b2 \(\times\) n3c5-b1 | 120 | 45 | 10 | 93.3%, 98% |
Set 5 | n3c5-b5 \(\times\) n3c5-b4 | 210 | 252 | 210 | 97.6%, 97.6% |
Set 6 | n3c5-b7 \(\times\) n3c5-b6 | 30 | 120 | 210 | 97.6%, 97.6% |
Set 7 | n3c6-b2 \(\times\) n3c6-b1 | 455 | 105 | 105 | 97.1%, 98.1% |
Set 8 | n4c5-b11 \(\times\) n4c5-b10 | 10 | 120 | 630 | 90%, 99.1% |
Set 9 | Transformer | 256 | 512 | 64 | 80%, 90% |
Set 10 | BERT | 128 | 768 | 64 | 70%, 90% |
Set 11 | DeiT-B | 256 | 768 | 64 | 90%, 80% |
Set 12 | DeiT-S | 1,024 | 384 | 64 | 90%, 70% |
5.2 Data Compression Analysis
5.3 Different Types of GEMM
5.3.1 Dense Regular and Dense Irregular GEMM.
5.3.2 Sparse Regular GEMM.
5.3.3 Sparse Irregular GEMM.
5.4 All Optimal Dataflow
5.5 Compare to the ExTensor
5.6 Hardware Cost Analysis
Design | SA | SIGMA | SparGD |
---|---|---|---|
Technology | Commercial 28nm | Commercial 28nm | Commercial 28nm |
Number of PEs | 64 | 64 | 64 |
Power (mw) | 69.5 | 127.3 | 155.5 |
Area ( \(um^2\) ) | Total: 35,347.0 | Total: 54,901.5 | Total: 64,279.1 |
Local Buffer: 45% Controller: 1.5% PEs: 48.5% Accumulator: 5% | Local Buffer: 43% Controller: 7% Benes: 12% PEs: 30.5% FAN: 5% Accumulator: 2.5% | Local Buffer: 36.5% Controller: 10.5% PE Bus: 5.5% PEs: 32% PAT: 13.5% Accumulator: 2% |
5.7 Scalability Analysis
6 Related Work
6.1 Sparsity
6.1.1 Single-sided Sparsity.
6.1.2 Double-sided Sparsity.
6.1.3 For Sparse GEMM.
6.2 Flexible Interconnect
7 Conclusion
References
Index Terms
- SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow
Recommendations
SparG: A Sparse GEMM Accelerator for Deep Learning Applications
Algorithms and Architectures for Parallel ProcessingAbstractDeep learning has become a hot field of research. Previously, the deep learning algorithms were mainly run by the CPU and GPU. With the rapid development of deep learning, it has been found that the previous processors can no longer carry the ...
SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning
Network and Parallel ComputingAbstractNowadays, deep learning is prevalent in many fields. The primary workload in deep learning is the General Matrix-matrix Multiplication (GEMM). The TPU is the state-of-the-art GEMM accelerator. However, it does not support sparsity. In this paper, ...
A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be ...
Comments
Information & Contributors
Information
Published In
Publisher
Association for Computing Machinery
New York, NY, United States
Journal Family
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
Funding Sources
- National Key R&D
- NSFC
- NSF of Hunan Province
- STIP of Hunan Province
- Key Laboratory of Advanced Microprocessor Chips and Systems
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 1,464Total Downloads
- Downloads (Last 12 months)1,464
- Downloads (Last 6 weeks)210
Other Metrics
Citations
Cited By
View allView Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in