Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673038.3673054acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core Processors

Published: 12 August 2024 Publication History

Abstract

We propose the SaSpGEMM: a parallel sparse general matrix-matrix multiplication (SpGEMM) to avoid the overhead of sorting. The typical workflow of SpGEMM contains: size prediction, memory allocation, numeric calculation and sorting. However, sorting has always been overlooked as a bottleneck in the performance of SpGEMM. It constitutes an average of 30% in HASH and 55% in ESC during the calculation stage. The key idea behind SaSpGEMM is to leverage the compressed sparse row (CSR) storage format’s feature of increasing index order of elements in the same row which is sorted during preprocessing, preserving intermediate products ordered consistently. To achieve this, we introduce a linked list-based accumulator (LLA) designed for batch insertion while maintaining order with a low time complexity. We provide a comprehensive empirical evidence showing that SaSpGEMM outperforms other methods based on time complexity analysis. Compared to three state-of-the-art methods ESC, SPA, and HASH on both x86 (Intel Xeon Gold 6348) and ARM (Phytium2000+) architectures, our method achieves an average speedup of 2.82x, 5.24x, 1.16x (with a maximum speedup of 34.4x, 195x, 23.3x) on the Intel Xeon Gold 6348. On the Phytium2000+, it achieves an average speedup of 2.21x, 4,65x, 1.05x (with a maximum speedup of 40.93x, 146.7x, 9.03x).

References

[1]
Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, and Sivan Toledo. 2016. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. SIAM Journal on Matrix Analysis and Applications,SIAM Journal on Matrix Analysis and Applications (Nov 2016).
[2]
Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides, and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research 46, 6 (Apr 2018), e33–e33. https://doi.org/10.1093/nar/gkx1313
[3]
Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal on Scientific Computing (Jan 2012), C123–C152. https://doi.org/10.1137/110838844
[4]
Mauro Bisson and Massimiliano Fatica. 2019. A GPU Implementation of the Sparse Deep Neural Network Graph Challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). https://doi.org/10.1109/hpec.2019.8916223
[5]
Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing. 807–817.
[6]
Ian J. Davis. 1992. A fast radix sort. The computer journal 35, 6 (1992), 636–642.
[7]
Timothy A. Davis, Mohsen Aznaveh, and Scott Kolodziej. 2019. Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). https://doi.org/10.1109/hpec.2019.8916550
[8]
Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software (Nov 2011), 1–25. https://doi.org/10.1145/2049662.2049663
[9]
Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. https://doi.org/10.1109/SC41404.2022.00071
[10]
Valentin Le Fèvre and Marc Casas. 2023. Optimization of SpGEMM with Risc-V vector instructions. arXiv preprint arXiv:2303.02471 (2023).
[11]
Jianhua Gao, Weixing Ji, Zhaonian Tan, and Yueyan Zhao. 2020. A Systematic Survey of General Sparse Matrix-Matrix Multiplication. arXiv: Distributed, Parallel, and Cluster Computing,arXiv: Distributed, Parallel, and Cluster Computing (Feb 2020).
[12]
JohnR. Gilbert and ViralB. Shah. 2007. An interactive system for combinatorial scientific computing with an emphasis on programmer productivity. (Jan 2007).
[13]
John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Matrices in MATLAB: Design and Implementation. SIAM J. Matrix Anal. Appl. (Jan 1992), 333–356. https://doi.org/10.1137/0613024
[14]
John R. Gilbert, Steve Reinhardt, and Viral B. Shah. 2007. High-Performance Graph Algorithms from Parallel Sparse Matrices. 260–269. https://doi.org/10.1007/978-3-540-75755-9_32
[15]
Zhixiang Gu, JoséE. Moreira, David Edelsohn, and Ariful Azad. 2020. Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking. Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (Feb 2020).
[16]
Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Software (Sep 1978), 250–269. https://doi.org/10.1145/355791.355796
[17]
Charles AR Hoare. 1962. Quicksort. The computer journal 5, 1 (1962), 10–16.
[18]
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. https://doi.org/10.1145/3208040.3208062
[19]
Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/sc41405.2020.00076
[20]
Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Scalable SIMD-Efficient Graph Processing on GPUs. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques(PACT ’15). 39–50.
[21]
Valentin Le Fèvre and Marc Casas. 2023. Efficient Execution of SpGEMM on Long Vector Architectures. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 101–113.
[22]
Kenli Li, Wangdong Yang, and Keqin Li. 2014. Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Transactions on Parallel and Distributed Systems 26, 1 (2014), 196–205.
[23]
Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards efficient spmv on sunway manycore architectures. In Proceedings of the 2018 International Conference on Supercomputing. 363–373.
[24]
Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. https://doi.org/10.1109/ipdps.2014.47
[25]
Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 370–381.
[26]
Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2019. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Comput. 90 (2019), 102545.
[27]
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In 2017 46th International Conference on Parallel Processing (ICPP). https://doi.org/10.1109/icpp.2017.19
[28]
Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 90–106.
[29]
Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2020. spECK. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. https://doi.org/10.1145/3332466.3374521
[30]
Fazle Sadi, Joe Sweeney, Tze Meng Low, James C Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 347–358.
[31]
Qinglin Wang, Dongsheng Li, Xiandong Huang, Siqi Shen, Songzhu Mei, and Jie Liu. 2020. Optimizing FFT-based convolution on ARMv8 multi-core CPUs. In European Conference on Parallel Processing. Springer, 248–262.
[32]
Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. In Proceedings of the 24th symposium on principles and practice of parallel programming. 68–81.
[33]
Rui Xia, Xiao-Wei Guo, Chao Li, and Jie Liu. 2023. Direct numerical simulation of acoustic wave propagation in ocean waveguides using a parallel finite volume solver. Ocean Engineering 281 (2023), 114894.
[34]
Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Y Zomaya, and Tao Li. 2019. Caspmv: A customized and accelerative spmv framework for the sunway taihulight. IEEE Transactions on Parallel and Distributed Systems 32, 1 (2019), 131–146.
[35]
Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. Cvr: Efficient vectorization of spmv on x86 processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 149–162.
[36]
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. Acm Sigplan Notices 49, 8 (2014), 107–118.
[37]
Carl Yang, Aydin Buluc, and JohnD. Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU. Cornell University - arXiv,Cornell University - arXiv (Mar 2018).
[38]
Wangdong Yang, Kenli Li, Zeyao Mo, and Keqin Li. 2014. Performance optimization using partitioned SpMV on GPUs and multicore CPUs. IEEE Trans. Comput. 64, 9 (2014), 2623–2636.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

  1. SpGEMM
  2. linked list-based accumulator
  3. multi-core processors
  4. sorting-avoiding

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • the National Key Research and Development Program of China

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 253
    Total Downloads
  • Downloads (Last 12 months)253
  • Downloads (Last 6 weeks)96
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media