Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3545008.3545086acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

DSSA: Dual-Side Sparse Systolic Array Architecture for Accelerating Convolutional Neural Network Training

Published: 13 January 2023 Publication History

Abstract

Ever-growing CNN size incurs a significant amount of redundancy in model parameters, which in turn, puts considerable burden on hardware. Unstructured pruning is widely used to reduce model sparsity. While, the irregularity introduced by unstructured pruning makes it difficult to accelerate sparse CNNs on systolic array. To address this issue, a variety of accelerators have been proposed. SIGMA, the state-of-the-art sparse GEMM accelerator, achieves significant speedup over systolic array. However, SIGMA suffers from two disadvantages: 1) it only supports one-side sparsity, leaving potential for further performance gains; 2) SIGMA improves utilization of large-sized systolic arrays at the cost of extra overhead.
In this paper, we propose DSSA, a dual-side sparse systolic array, to accelerate CNN training. DSSA bases its designs on a small-sized systolic array, which naturally achieves higher cell utilization without additional overhead. To facilitate dual-side sparsity processing, DSSA utilizes a cross-cycle reduction module to accumulate partial sum that belongs to the same column but being processed in different cycles. A comprehensive design space exploration is performed to seek the local optimal configurations for DSSA. We implement the logic design of DSSA using Verilog in RTL and evaluate its performance using a C++-based cycle-accurate performance simulator we built. Experimental results show that DSSA delivers, on average, a speedup of 2.13x and 13.81x over SIGMA and a basic systolic array with the same number of cells. Compared to SIGMA, DSSA incurs 16.59% area overhead and 25.49% power overhead when sparse filter is excluded, as SIGMA did.

References

[1]
L. J. Ba and R. Caruana. 2013. Do deep nets really need to be deep?arXiv preprint arXiv:1312.6184(2013).
[2]
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC. 1–11.
[3]
Y-Hsin Chen, T.-Ju Yang, 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Trans. Emerg. Sel. Topics Circuits Syst. 9, 2 (2019), 292–308.
[4]
R. Collobert, J. Weston, 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12, ARTICLE (2011), 2493–2537.
[5]
M. Denil, B. Shakibi, 2013. Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543(2013).
[6]
H. Genc, A. Haj-Ali, 2019. Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925 3 (2019), 25.
[7]
Y. Gong, L. Liu, 2014. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115(2014).
[8]
S. Gupta, A. Agrawal, 2015. Deep learning with limited numerical precision. In ICML. 1737–1746.
[9]
S. Han, X. Liu, 2016. EIE: Efficient inference engine on compressed deep neural network. In ISCA. 243–254.
[10]
S. Han, H. Mao, 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149(2015).
[11]
B. Hassibi and D. G. Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. Morgan Kaufmann.
[12]
K. He, X. Zhang, 2016. Deep residual learning for image recognition. In CVPR. 770–778.
[13]
X. He, S. Pal, 2020. Sparse-tpu: Adapting systolic arrays for sparse matrices. In ICS. 1–12.
[14]
A. G. Howard, M. Zhu, 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861(2017).
[15]
H. Hu, R. Peng, Y.-Wing Tai, 2016. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250(2016).
[16]
F. N. Iandola, S. Han, 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360(2016).
[17]
M. Jaderberg, A. Vedaldi, 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866(2014).
[18]
Y. Jia, E. Shelhamer, 2014. Caffe: Convolutional architecture for fast feature embedding. In MM. 675–678.
[19]
N. P. Jouppi, C. Young, 2017. In-datacenter performance analysis of a tensor processing unit. In ISCA. 1–12.
[20]
A. Korattikara, V. Rathod, 2015. Bayesian dark knowledge. arXiv preprint arXiv:1506.04416(2015).
[21]
A. Krizhevsky, I. Sutskever, 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097–1105.
[22]
HT Kung, B. McDanel, 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In ASPLOS. 821–834.
[23]
Hsiang-Tsung Kung. 1982. Why systolic architectures?Computer 15, 01 (1982), 37–46.
[24]
Y. LeCun, L. Bottou, 1998. Gradient-based learning applied to document recognition. Proc. of the IEEE 86, 11 (1998), 2278–2324.
[25]
Y. LeCun, J. S. Denker, 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.
[26]
W. Liu, D. Anguelov, 2016. Ssd: Single shot multibox detector. In European conference on computer vision. 21–37.
[27]
Zhi. Liu, P. N. Whatmough, 2020. Sparse systolic tensor array for efficient CNN hardware acceleration. arXiv preprint arXiv:2009.02381(2020).
[28]
A. Mishra, J. A. Latorre, 2021. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378(2021).
[29]
Nvidia. 2020. NVIDIA A100 Tensor Core GPU Architecture. (2020).
[30]
S. Pal, J. Beaumont, 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In HPCA. 724–736.
[31]
A. Parashar, M. Rhu, 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. In ISCA. 27–40.
[32]
J. Peltenburg, S. Ren, 2016. Maximizing systolic array efficiency to accelerate the PairHMM forward algorithm. In BIBM. 758–762.
[33]
E. Qin, A. Samajdar, 2020. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In HPCA. 58–70.
[34]
J. Redmon and A. Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767(2018).
[35]
F. Shi, H. Li, 2018. Sparse Winograd Convolutional neural networks on small-scale systolic arrays. arXiv preprint arXiv:1810.01973(2018).
[36]
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).
[37]
N. Srivastava, H. Jin, 2020. Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In MICRO. 766–780.
[38]
Y. Wang, C. Zhang, 2021. Dual-side Sparse Tensor Core. arXiv preprint arXiv:2105.09564(2021).
[39]
W. Wen, C. Wu, 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems 29 (2016), 2074–2082.
[40]
Z. Yang, L. Wang, 2018. Systolic array based accelerator and algorithm mapping for deep learning algorithms. In NPC. 153–158.
[41]
G. Zhang, N. Attaluri, 2021. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In ASPLOS. 687–701.
[42]
Z. Zhang, H. Wang, 2020. Sparch: Efficient architecture for sparse matrix multiplication. In HPCA. 261–274.
[43]
M. Zhu and Y. Xie. 2020. Taming unstructured sparsity on GPUs via latency-aware optimization. In DAC. 1–6.
[44]
M. Zhu, T. Zhang, 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In MICRO. 359–371.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
August 2022
976 pages
ISBN:9781450397339
DOI:10.1145/3545008
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Convolutional Neural Network
  2. Hardware Accelerator
  3. Sparsity Processing
  4. Systolic Array
  5. Unstructured Pruning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '22
ICPP '22: 51st International Conference on Parallel Processing
August 29 - September 1, 2022
Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 148
    Total Downloads
  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)2
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media