Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3433701.3433722acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Accelerating sparse DNN models without hardware-support via tile-wise sparsity

Published: 09 November 2020 Publication History

Abstract

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations. As such, prior works usually modify or design completely new sparsity-optimized architectures for exploiting sparsity. We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. Our work builds upon the insight that the matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We propose a tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95 X speedup over the dense model.

References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[2]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," arXiv preprint arXiv:2005.14165, 2020.
[3]
S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, and L. Zhang, "Efficient and effective sparse lstm on fpga with bank-balanced sparsity," in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA '19. New York, NY, USA: Association for Computing Machinery, 2019, p. 63--72. [Online]. Available
[4]
C. Chao and B. Saeta, "Cloud TPU: Codesigning Architecture and Infrastructure," https://www.hotchips.org/hc31/HC31_T3_Cloud_TPU_Codesign.pdf, 2019.
[5]
K. Chellapilla, S. Puri, and P. Simard, "High performance convolutional neural networks for document processing," 2006.
[6]
Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127--138, 2016.
[7]
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cudnn: Efficient primitives for deep learning," arXiv preprint arXiv:1410.0759, 2014.
[8]
R. Child, S. Gray, A. Radford, and I. Sutskever, "Generating long sequences with sparse transformers," arXiv preprint arXiv:1904.10509, 2019.
[9]
K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, "On the properties of neural machine translation: Encoder-decoder approaches," arXiv preprint arXiv:1409.1259, 2014.
[10]
K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, "What does bert look at? an analysis of bert's attention," arXiv preprint arXiv:1906.04341, 2019.
[11]
C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, and B. Yuan, "Tie: energy-efficient tensor train-based inference engine for deep neural network," in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 264--278.
[12]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[13]
A. Elafrou, G. Goumas, and N. Koziris, "Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--15.
[14]
D. Fujiki, N. Chatterjee, D. Lee, and M. O'Connor, "Near-memory data transformation for efficient sparse matrix multi-vector multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--17.
[15]
Google, "Understanding searches better than ever before," https://www.blog.google/products/search/search-language-understanding-bert/, 2019.
[16]
C. Guo, Y. Zhou, J. Leng, Y. Zhu, Z. Du, Q. Chen, C. Li, B. Yao, and M. Guo, "Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration," in 2020 57th ACM/IEEE Design Automation Conference (DAC), 2020, pp. 1--6.
[17]
C. Guo, "Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration," arXiv preprint arXiv:2002.08326, 2020.
[18]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: Efficient inference engine on compressed deep neural network," in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA '16. IEEE Press, 2016, p. 243--254. [Online]. Available
[19]
S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network," in Advances in neural information processing systems, 2015, pp. 1135--1143.
[20]
K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, "Extensor: An accelerator for sparse tensor algebra," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 319--333.
[21]
P. Hill, A. Jain, M. Hill, B. Zamirai, C.-H. Hsu, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars, "Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 786--799.
[22]
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997.
[23]
W. Hua, Y. Zhou, C. De Sa, Z. Zhang, and G. E. Suh, "Boosting the performance of cnn accelerators with dynamic fine-grained channel gating," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 139--150.
[24]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., "In-datacenter performance analysis of a tensor processing unit," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1--12.
[25]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of The ACM, vol. 60, no. 6, pp. 84--90, 2017.
[26]
H. Kung, B. McDanel, and S. Q. Zhang, "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '19. New York, NY, USA: Association for Computing Machinery, 2019, p. 821--834. [Online]. Available
[27]
Y. LeCun, Y. Bengio et al., "Convolutional networks for images, speech, and time series," The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
[28]
Y. LeCun, J. S. Denker, and S. A. Solla, "Optimal brain damage," in Advances in neural information processing systems, 1990, pp. 598--605.
[29]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture. New York, NY, USA: Association for Computing Machinery, 2013, pp. 487--498. [Online]. Available
[30]
J.-H. Luo, J. Wu, and W. Lin, "Thinet: A filter level pruning method for deep neural network compression," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5058--5066.
[31]
M. Luong, E. Brevdo, and R. Zhao, "Neural machine translation (seq2seq) tutorial," https://github.com/tensorflow/nmt, 2017.
[32]
M.-T. Luong and C. D. Manning, "Achieving open vocabulary neural machine translation with hybrid word-character models," in Association for Computational Linguistics (ACL), Berlin, Germany, August 2016. [Online]. Available: https://nlp.stanford.edu/pubs/luong2016acl_hybrid.pdf
[33]
P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, "Importance estimation for neural network pruning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11264--11272.
[34]
P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, "Pruning convolutional neural networks for resource efficient inference," arXiv preprint arXiv:1611.06440, 2016.
[35]
S. Narang, E. Undersander, and G. Diamos, "Block-sparse recurrent neural networks," arXiv preprint arXiv:1711.02782, 2017.
[36]
I. Nisa, J. Li, A. Sukumaran-Rajam, P. S. Rawat, S. Krishnamoorthy, and P. Sadayappan, "An efficient mixed-mode representation of sparse tensors," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--25.
[37]
W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren, "Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '20, 2020.
[38]
NVIDIA, "GPU Pro Tip: CUDA 7 Streams Simplify Concurrency," https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/, 2015.
[39]
NVIDIA, "NVIDIA Volta GPU Architecture Whitepaper," 2017.
[40]
NVIDIA, "CUDA Toolkit Documentation v10.1," 2019.
[41]
NVIDIA, "CUTLASS 1.3," https://github.com/NVIDIA/cutlass, 2019.
[42]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002, pp. 311--318.
[43]
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA '17. New York, NY, USA: Association for Computing Machinery, 2017, p. 27--40. [Online]. Available
[44]
E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58--70.
[45]
Y. Qiu, J. Leng, C. Guo, Q. Chen, C. Li, M. Guo, and Y. Zhu, "Adversarial Defense Through Network Profiling Based Path Extraction," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4777--4786.
[46]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
[47]
M. A. Raihan, N. Goli, and T. M. Aamodt, "Modeling deep learning accelerator enabled gpus," in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79--92.
[48]
P. Rajpurkar, R. Jia, and P. Liang, "Know what you don't know: Unanswerable questions for squad," arXiv preprint arXiv:1806.03822, 2018.
[49]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, "Squad: 100,000+ questions for machine comprehension of text," arXiv preprint arXiv:1606.05250, 2016.
[50]
M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient gpu architectures," in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013, pp. 86--98.
[51]
U. A. ROMAN STEINBERG, "6 areas where artificial neural networks outperform humans," https://venturebeat.com/2017/12/08/6-areas-where-artificial-neural-networks-outperform-humans/, 2018.
[52]
F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti, "Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 347--358.
[53]
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," in ICLR 2015 : International Conference on Learning Representations 2015, 2015.
[54]
E. Strubell, A. Ganesh, and A. McCallum, "Energy and policy considerations for deep learning in NLP," CoRR, vol. abs/1906.02243, 2019. [Online]. Available: http://arxiv.org/abs/1906.02243
[55]
P. Tillet, "Torch-Blocksparse," https://github.com/ptillet/torch-blocksparse, 2020.
[56]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2017. [Online]. Available: https://arxiv.org/pdf/1706.03762.pdf
[57]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998--6008.
[58]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "Glue: A multi-task benchmark and analysis platform for natural language understanding," in ICLR 2019 : 7th International Conference on Learning Representations, 2019.
[59]
W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, "Learning structured sparsity in deep neural networks," in Advances in neural information processing systems, 2016, pp. 2074--2082.
[60]
H. Yang, S. Gui, Y. Zhu, and J. Liu, "Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach," International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[61]
H. Yang, Y. Zhu, and J. Liu, "Ecc: Energy-constrained deep neural network compression via a bilinear regression model," International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[62]
H. Yang, "Energy-constrained compression for deep neural networks via weighted sparse projection and layer input masking," International Conference on Learning Representations (ICLR), 2019.
[63]
T.-J. Yang, Y.-H. Chen, and V. Sze, "Designing energy-efficient convolutional neural networks using energy-aware pruning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5687--5695.
[64]
T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam, "Netadapt: Platform-aware neural network adaptation for mobile applications," in Proceedings of the European Conference on Computer Vision, 2018, pp. 285--300.
[65]
T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I.-C. Tseng, H.-W. Hu, H.-S. Chang, and H.-P. Li, "Sparse reram engine: joint exploration of activation and weight sparsity in compressed neural networks," in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 236--249.
[66]
Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie, "Balanced sparsity for efficient dnn inference on gpu," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5676--5683.
[67]
R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, "Nisp: Pruning networks using neuron importance score propagation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9194--9203.
[68]
J. Zhang, X. Chen, M. Song, and T. Li, "Eager pruning: algorithm and architecture support for fast training of deep neural networks," in Proceedings of the 46th International Symposium on Computer Architecture. ACM, 2019, pp. 292--303.
[69]
T. Zhang, S. Ye, K. Zhang, X. Ma, N. Liu, L. Zhang, J. Tang, K. Ma, X. L. Lin, M. Fardad, and Y. Wang, "Structadmm: A systematic, high-efficiency framework of structured weight pruning for dnns," 2018.
[70]
M. Zhu, T. Zhang, Z. Gu, and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2019, pp. 359--371.

Cited By

View all
  • (2025)BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelismFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-3401-519:1Online publication date: 1-Jan-2025
  • (2024)Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor CoreProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673108(1124-1134)Online publication date: 12-Aug-2024
  • (2024)Tetris: Accelerating Sparse Convolution by Exploiting Memory Reuse on GPUProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638471(229-242)Online publication date: 2-Mar-2024
  • Show More Cited By
  1. Accelerating sparse DNN models without hardware-support via tile-wise sparsity

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2020
    1454 pages
    ISBN:9781728199986

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    IEEE Press

    Publication History

    Published: 09 November 2020

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelismFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-3401-519:1Online publication date: 1-Jan-2025
    • (2024)Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor CoreProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673108(1124-1134)Online publication date: 12-Aug-2024
    • (2024)Tetris: Accelerating Sparse Convolution by Exploiting Memory Reuse on GPUProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638471(229-242)Online publication date: 2-Mar-2024
    • (2024)Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN PruningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651351(416-430)Online publication date: 27-Apr-2024
    • (2024)GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory StitchingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640423(450-466)Online publication date: 27-Apr-2024
    • (2024)Amanda: Unified Instrumentation Framework for Deep Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624864(1-18)Online publication date: 27-Apr-2024
    • (2023)Gradient-free structured pruning with unlabeled dataProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619505(26326-26341)Online publication date: 23-Jul-2023
    • (2023)Register Tiling for Unstructured Sparsity in Neural Network InferenceProceedings of the ACM on Programming Languages10.1145/35913027:PLDI(1995-2020)Online publication date: 6-Jun-2023
    • (2023)Energy-Latency Attacks to On-Device Neural Networks via Sponge PoisoningProceedings of the 2023 Secure and Trustworthy Deep Learning Systems Workshop10.1145/3591197.3591307(1-11)Online publication date: 10-Jul-2023
    • (2023)DistSimProceedings of the 20th ACM International Conference on Computing Frontiers10.1145/3587135.3592200(112-122)Online publication date: 9-May-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media