research-article

DSSA: Dual-Side Sparse Systolic Array Architecture for Accelerating Convolutional Neural Network Training

Authors:

Zuoning ChenAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 17, Pages 1 - 10

https://doi.org/10.1145/3545008.3545086

Published: 13 January 2023 Publication History

Abstract

Ever-growing CNN size incurs a significant amount of redundancy in model parameters, which in turn, puts considerable burden on hardware. Unstructured pruning is widely used to reduce model sparsity. While, the irregularity introduced by unstructured pruning makes it difficult to accelerate sparse CNNs on systolic array. To address this issue, a variety of accelerators have been proposed. SIGMA, the state-of-the-art sparse GEMM accelerator, achieves significant speedup over systolic array. However, SIGMA suffers from two disadvantages: 1) it only supports one-side sparsity, leaving potential for further performance gains; 2) SIGMA improves utilization of large-sized systolic arrays at the cost of extra overhead.

In this paper, we propose DSSA, a dual-side sparse systolic array, to accelerate CNN training. DSSA bases its designs on a small-sized systolic array, which naturally achieves higher cell utilization without additional overhead. To facilitate dual-side sparsity processing, DSSA utilizes a cross-cycle reduction module to accumulate partial sum that belongs to the same column but being processed in different cycles. A comprehensive design space exploration is performed to seek the local optimal configurations for DSSA. We implement the logic design of DSSA using Verilog in RTL and evaluate its performance using a C++-based cycle-accurate performance simulator we built. Experimental results show that DSSA delivers, on average, a speedup of 2.13x and 13.81x over SIGMA and a basic systolic array with the same number of cells. Compared to SIGMA, DSSA incurs 16.59% area overhead and 25.49% power overhead when sparse filter is excluded, as SIGMA did.

References

[1]

L. J. Ba and R. Caruana. 2013. Do deep nets really need to be deep?arXiv preprint arXiv:1312.6184(2013).

[2]

N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC. 1–11.

[3]

Y-Hsin Chen, T.-Ju Yang, 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Trans. Emerg. Sel. Topics Circuits Syst. 9, 2 (2019), 292–308.

[4]

R. Collobert, J. Weston, 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12, ARTICLE (2011), 2493–2537.

[5]

M. Denil, B. Shakibi, 2013. Predicting parameters in deep learning. arXiv preprint arXiv:1306.0543(2013).

[6]

H. Genc, A. Haj-Ali, 2019. Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. arXiv preprint arXiv:1911.09925 3 (2019), 25.

[7]

Y. Gong, L. Liu, 2014. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115(2014).

[8]

S. Gupta, A. Agrawal, 2015. Deep learning with limited numerical precision. In ICML. 1737–1746.

[9]

S. Han, X. Liu, 2016. EIE: Efficient inference engine on compressed deep neural network. In ISCA. 243–254.

Digital Library

[10]

S. Han, H. Mao, 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149(2015).

[11]

B. Hassibi and D. G. Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. Morgan Kaufmann.

[12]

K. He, X. Zhang, 2016. Deep residual learning for image recognition. In CVPR. 770–778.

[13]

X. He, S. Pal, 2020. Sparse-tpu: Adapting systolic arrays for sparse matrices. In ICS. 1–12.

Digital Library

[14]

A. G. Howard, M. Zhu, 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861(2017).

[15]

H. Hu, R. Peng, Y.-Wing Tai, 2016. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250(2016).

[16]

F. N. Iandola, S. Han, 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360(2016).

[17]

M. Jaderberg, A. Vedaldi, 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866(2014).

[18]

Y. Jia, E. Shelhamer, 2014. Caffe: Convolutional architecture for fast feature embedding. In MM. 675–678.

Digital Library

[19]

N. P. Jouppi, C. Young, 2017. In-datacenter performance analysis of a tensor processing unit. In ISCA. 1–12.

[20]

A. Korattikara, V. Rathod, 2015. Bayesian dark knowledge. arXiv preprint arXiv:1506.04416(2015).

[21]

A. Krizhevsky, I. Sutskever, 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097–1105.

[22]

HT Kung, B. McDanel, 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In ASPLOS. 821–834.

[23]

Hsiang-Tsung Kung. 1982. Why systolic architectures?Computer 15, 01 (1982), 37–46.

[24]

Y. LeCun, L. Bottou, 1998. Gradient-based learning applied to document recognition. Proc. of the IEEE 86, 11 (1998), 2278–2324.

[25]

Y. LeCun, J. S. Denker, 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.

[26]

W. Liu, D. Anguelov, 2016. Ssd: Single shot multibox detector. In European conference on computer vision. 21–37.

[27]

Zhi. Liu, P. N. Whatmough, 2020. Sparse systolic tensor array for efficient CNN hardware acceleration. arXiv preprint arXiv:2009.02381(2020).

[28]

A. Mishra, J. A. Latorre, 2021. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378(2021).

[29]

Nvidia. 2020. NVIDIA A100 Tensor Core GPU Architecture. (2020).

[30]

S. Pal, J. Beaumont, 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In HPCA. 724–736.

[31]

A. Parashar, M. Rhu, 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. In ISCA. 27–40.

Digital Library

[32]

J. Peltenburg, S. Ren, 2016. Maximizing systolic array efficiency to accelerate the PairHMM forward algorithm. In BIBM. 758–762.

[33]

E. Qin, A. Samajdar, 2020. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In HPCA. 58–70.

[34]

J. Redmon and A. Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767(2018).

[35]

F. Shi, H. Li, 2018. Sparse Winograd Convolutional neural networks on small-scale systolic arrays. arXiv preprint arXiv:1810.01973(2018).

[36]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).

[37]

N. Srivastava, H. Jin, 2020. Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In MICRO. 766–780.

[38]

Y. Wang, C. Zhang, 2021. Dual-side Sparse Tensor Core. arXiv preprint arXiv:2105.09564(2021).

[39]

W. Wen, C. Wu, 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems 29 (2016), 2074–2082.

[40]

Z. Yang, L. Wang, 2018. Systolic array based accelerator and algorithm mapping for deep learning algorithms. In NPC. 153–158.

[41]

G. Zhang, N. Attaluri, 2021. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In ASPLOS. 687–701.

[42]

Z. Zhang, H. Wang, 2020. Sparch: Efficient architecture for sparse matrix multiplication. In HPCA. 261–274.

[43]

M. Zhu and Y. Xie. 2020. Taming unstructured sparsity on GPUs via latency-aware optimization. In DAC. 1–6.

[44]

M. Zhu, T. Zhang, 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In MICRO. 359–371.

Cited By

Index Terms

DSSA: Dual-Side Sparse Systolic Array Architecture for Accelerating Convolutional Neural Network Training
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

A Survey of Design and Optimization for Systolic Array-based DNN Accelerators
In recent years, it has been witnessed that the systolic array is a successful architecture for DNN hardware accelerators. However, the design of systolic arrays also encountered many challenges. As DNN structures and applications become more complex, a ...
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control ...
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors
Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
148
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)2

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten