research-article

Partitioning sparse deep neural networks for scalable training and inference

Authors:

Gunduz Vehbi Demirci,

Hakan FerhatosmanogluAuthors Info & Claims

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

Pages 254 - 265

https://doi.org/10.1145/3447818.3460372

Published: 04 June 2021 Publication History

Abstract

The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly.

References

[1]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).

[2]

Kadir Akbudak, Enver Kayaaslan, and Cevdet Aykanat. 2013. Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication. SIAM Journal on Scientific Computing 35, 3 (2013), C237--C262.

Digital Library

[3]

Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-caffe: Co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205.

Digital Library

[4]

Mauro Bisson and Massimiliano Fatica. 2019. A GPU Implementation of the Sparse Deep Neural Network Graph Challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--8.

[5]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[6]

Adrián Castelló, Manuel F Dolz, Enrique S Quintana-Ortí, and José Duato. 2019. Analysis of model parallelism for distributed neural networks. In Proceedings of the 26th European MPI Users' Group Meeting. 1--10.

Digital Library

[7]

Umit V Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on parallel and distributed systems 10, 7 (1999), 673--693.

Digital Library

[8]

Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. 2020. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 125--136.

Digital Library

[9]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 571--582.

[10]

Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K Panda. 2020. Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.

Digital Library

[11]

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In International conference on machine learning. 1337--1345.

Digital Library

[12]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12, Aug (2011), 2493--2537.

Digital Library

[13]

Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).

[14]

Timothy A Davis. 2019. Algorithm 1000: SuiteSparse: GraphBLAS: Graph algorithms in the language of sparse linear algebra. ACM Transactions on Mathematical Software (TOMS) 45, 4 (2019), 1--25.

Digital Library

[15]

Timothy A Davis, Mohsen Aznaveh, and Scott Kolodziej. 2019. Write quick, run fast: Sparse deep neural network in 20 minutes of development time via SuiteSparse: GraphBLAS. In 2019 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1--6.

[16]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.

[17]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[18]

Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019).

[19]

Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Wei Wu, Ang Li, and Martin C Herbordt. 2019. O3BNN: An out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning. In Proceedings of the ACM International Conference on Supercomputing. 461--472.

Digital Library

[20]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[21]

Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5-6 (2005), 602--610.

[22]

Robert M Gray. 2006. Toeplitz and circulant matrices: A review. now publishers inc.

[23]

Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.

Digital Library

[24]

Babak Hassibi and David G Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems. 164--171.

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[26]

B Hendrickson and TG Kolda. [n.d.]. Partitioning Rectangular and Structurally Nonsymmetric Sparse Matrices for Parallel Processing, submitted to SIAM Journal of Scientific Computing.

[27]

Mert Hidayetoğlu, Carl Pearson, Vikram Sharma Mailthody, Eiman Ebrahimi, Jinjun Xiong, Rakesh Nagi, and Wen-mei Hwu. 2020. At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.

[28]

Sara Hooker, Aaron Courville, Yann Dauphin, and Andrea Frome. 2019. Selective Brain Damage: Measuring the Disparate Impact of Model Pruning. arXiv preprint arXiv:1911.05248 (2019).

[29]

Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.

[30]

Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring hidden dimensions in parallelizing convolutional neural networks. arXiv preprint arXiv:1802.04924 (2018).

[31]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. SysML 2019 (2019).

[32]

Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. 2016. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581 (2016).

[33]

George Karypis. 1998. hMETIS 1.5: A hypergraph partitioning package. http://www. cs. umn. edu/̃ metis (1998).

[34]

Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in distributed memory systems. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.

Digital Library

[35]

Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Albert Reuther, Ryan Robinett, and Sid Samsi. 2020. GraphChallenge. org Sparse Deep Neural Network Performance. arXiv preprint arXiv:2004.01181 (2020).

[36]

Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Ryan Robinett, and Sid Samsi. 2019. Sparse deep neural network graph challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.

[37]

Jeremy Kepner and Ryan Robinett. 2019. RadiX-Net: Structured Sparse Matrices for Deep Neural Networks. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 268--274.

[38]

Tamara G Kolda. 1998. Partitioning sparse rectangular matrices for parallel processing. In International Symposium on Solving Irregularly Structured Problems in Parallel. Springer, 68--79.

[39]

Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).

[40]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

[41]

Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998).

[42]

Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598--605.

[43]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598.

Digital Library

[44]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. [n.d.]. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 ([n. d.]).

[45]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017).

[46]

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 806--814.

[47]

Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning Sparse Neural Networks through L0 Regularization. arXiv preprint arXiv:1712.01312 (2017).

[48]

Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad Hammoud. 2019. Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.

[49]

Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad Hammoud. 2020. Studying the effects of hashing of sparse deep neural networks on data and model parallelisms. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.

[50]

Lin Ning and Xipeng Shen. 2019. Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse. In Proceedings of the ACM International Conference on Supercomputing. 438--448.

Digital Library

[51]

Filip Pawłowski, Rob H Bisseling, Bora Uçar, and AN Yzelman. 2020. Combinatorial Tiling for Sparse Neural Networks. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.

[52]

Ameya Prabhu, Girish Varma, and Anoop Namboodiri. 2018. Deep expander networks: Efficient deep networks from graph theory. In Proceedings of the European Conference on Computer Vision (ECCV). 20--35.

Digital Library

[53]

Gerald Schubert, Georg Hager, Holger Fehske, and Gerhard Wellein. 2011. Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+ OpenMP programming. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, 1751--1758.

Digital Library

[54]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[55]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.

Digital Library

[56]

Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1701--1708.

Digital Library

[57]

Linnan Wang, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. 2020. FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 113--124.

Digital Library

[58]

Xiaoyun Wang, Zhongyi Lin, Carl Yang, and John D Owens. 2019. Accelerating DNN Inference with GraphBLAS and the GPU. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.

[59]

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems. 1299--1309.

[60]

Xintian Yang, Srinivasan Parthasarathy, and Ponnuswamy Sadayappan. 2011. Fast sparse matrix-vector multiplication on GPUs: implications for graph mining. arXiv preprint arXiv:1103.2405 (2011).

[61]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888 6 (2017).

[62]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2019. Fast deep neural network training on distributed systems and cloud TPUs. IEEE Transactions on Parallel and Distributed Systems 30, 11 (2019), 2449--2462.

Digital Library

[63]

Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in neural information processing systems. 685--693.

[64]

Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017).

Cited By

Guliyev RHaldar AFerhatosmanoglu H(2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681961
Oakley JFerhatosmanoglu H(2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00168
Zeng GZou Y(2023)Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUsElectronics10.3390/electronics1217368712:17(3687)Online publication date: 31-Aug-2023
https://doi.org/10.3390/electronics12173687
Show More Cited By

Index Terms

Partitioning sparse deep neural networks for scalable training and inference
1. Computing methodologies

Recommendations

Performance Optimization of Sparse Deep Neural Networks Based on GPU
HP3C '22: Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications

Deep neural networks are widely used in various fields. However, due to the large scale of the latest deep neural networks, the research on the sparsity of deep neural networks is constantly carried out. The implementation of the sparse deep neural ...
SNICIT: Accelerating Sparse Neural Network Inference via Compression at Inference Time on GPU
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Sparse deep neural network (DNN) has become an important technique for reducing the inference cost of large DNNs. However, computing large sparse DNNs is very challenging because inference iterations can incur highly irregular patterns and unbalanced ...
Transformed ℓ 1 regularization for learning sparse deep neural networks
Abstract
Deep Neural Networks (DNNs) have achieved extraordinary success in numerous areas. However, DNNs often carry a large number of weight parameters, leading to the challenge of heavy memory and computation costs. Overfitting is another ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

June 2021

506 pages

ISBN:9781450383356

DOI:10.1145/3447818

General Chairs:
Huiyang Zhou
North Carolina State University
,
Jose Moreira
IBM Research
,
Program Chairs:
Frank Mueller
North Carolina State University
,
Yoav Etsion
Technion

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '21

Sponsor:

SIGARCH

ICS '21: 2021 International Conference on Supercomputing

June 14 - 17, 2021

Virtual Event, USA

Acceptance Rates

ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)4

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guliyev RHaldar AFerhatosmanoglu H(2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681961
Oakley JFerhatosmanoglu H(2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00168
Zeng GZou Y(2023)Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUsElectronics10.3390/electronics1217368712:17(3687)Online publication date: 31-Aug-2023
https://doi.org/10.3390/electronics12173687
Deng XLi JMa CWei KShi LDing MChen W(2023)Low-Latency Federated Learning With DNN Partition in Distributed Industrial IoT NetworksIEEE Journal on Selected Areas in Communications10.1109/JSAC.2022.322943641:3(755-775)Online publication date: Mar-2023
https://doi.org/10.1109/JSAC.2022.3229436
Zhang HWu TMa ZLi FLiu J(2023)Dynamic layer-wise sparsification for distributed deep learningFuture Generation Computer Systems10.1016/j.future.2023.04.022147:C(1-15)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1016/j.future.2023.04.022
Xiao XLi CLei Y(2022)A Lightweight Self-Supervised Representation Learning Algorithm for Scene Classification in Spaceborne SAR and Optical ImagesRemote Sensing10.3390/rs1413295614:13(2956)Online publication date: 21-Jun-2022
https://doi.org/10.3390/rs14132956
Liu SCao YSun S(2022)Mapping and Optimization Method of SpMV on Multi-DSP AcceleratorElectronics10.3390/electronics1122369911:22(3699)Online publication date: 11-Nov-2022
https://doi.org/10.3390/electronics11223699
Demirci GHaldar AFerhatosmanoglu H(2022)Scalable Graph Convolutional Network Training on Distributed-Memory SystemsProceedings of the VLDB Endowment10.14778/3574245.357425616:4(711-724)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574256

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents