research-article

Open access

SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core Processors

Authors: Chuhe Hong, Qinglin Wang, Runzhang Mao, Yuechao Liang, Rui Xia, Jie LiuAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 1166 - 1175

https://doi.org/10.1145/3673038.3673054

Published: 12 August 2024 Publication History

All formats PDF

Abstract

We propose the SaSpGEMM: a parallel sparse general matrix-matrix multiplication (SpGEMM) to avoid the overhead of sorting. The typical workflow of SpGEMM contains: size prediction, memory allocation, numeric calculation and sorting. However, sorting has always been overlooked as a bottleneck in the performance of SpGEMM. It constitutes an average of 30% in HASH and 55% in ESC during the calculation stage. The key idea behind SaSpGEMM is to leverage the compressed sparse row (CSR) storage format’s feature of increasing index order of elements in the same row which is sorted during preprocessing, preserving intermediate products ordered consistently. To achieve this, we introduce a linked list-based accumulator (LLA) designed for batch insertion while maintaining order with a low time complexity. We provide a comprehensive empirical evidence showing that SaSpGEMM outperforms other methods based on time complexity analysis. Compared to three state-of-the-art methods ESC, SPA, and HASH on both x86 (Intel Xeon Gold 6348) and ARM (Phytium2000+) architectures, our method achieves an average speedup of 2.82x, 5.24x, 1.16x (with a maximum speedup of 34.4x, 195x, 23.3x) on the Intel Xeon Gold 6348. On the Phytium2000+, it achieves an average speedup of 2.21x, 4,65x, 1.05x (with a maximum speedup of 40.93x, 146.7x, 9.03x).

References

[1]

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, and Sivan Toledo. 2016. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. SIAM Journal on Matrix Analysis and Applications,SIAM Journal on Matrix Analysis and Applications (Nov 2016).

[2]

Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides, and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research 46, 6 (Apr 2018), e33–e33. https://doi.org/10.1093/nar/gkx1313

[3]

Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal on Scientific Computing (Jan 2012), C123–C152. https://doi.org/10.1137/110838844

Digital Library

[4]

Mauro Bisson and Massimiliano Fatica. 2019. A GPU Implementation of the Sparse Deep Neural Network Graph Challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). https://doi.org/10.1109/hpec.2019.8916223

[5]

Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing. 807–817.

Digital Library

[6]

Ian J. Davis. 1992. A fast radix sort. The computer journal 35, 6 (1992), 636–642.

[7]

Timothy A. Davis, Mohsen Aznaveh, and Scott Kolodziej. 2019. Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). https://doi.org/10.1109/hpec.2019.8916550

[8]

Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Software (Nov 2011), 1–25. https://doi.org/10.1145/2049662.2049663

Digital Library

[9]

Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. https://doi.org/10.1109/SC41404.2022.00071

[10]

Valentin Le Fèvre and Marc Casas. 2023. Optimization of SpGEMM with Risc-V vector instructions. arXiv preprint arXiv:2303.02471 (2023).

[11]

Jianhua Gao, Weixing Ji, Zhaonian Tan, and Yueyan Zhao. 2020. A Systematic Survey of General Sparse Matrix-Matrix Multiplication. arXiv: Distributed, Parallel, and Cluster Computing,arXiv: Distributed, Parallel, and Cluster Computing (Feb 2020).

[12]

JohnR. Gilbert and ViralB. Shah. 2007. An interactive system for combinatorial scientific computing with an emphasis on programmer productivity. (Jan 2007).

Digital Library

[13]

John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Matrices in MATLAB: Design and Implementation. SIAM J. Matrix Anal. Appl. (Jan 1992), 333–356. https://doi.org/10.1137/0613024

Digital Library

[14]

John R. Gilbert, Steve Reinhardt, and Viral B. Shah. 2007. High-Performance Graph Algorithms from Parallel Sparse Matrices. 260–269. https://doi.org/10.1007/978-3-540-75755-9_32

[15]

Zhixiang Gu, JoséE. Moreira, David Edelsohn, and Ariful Azad. 2020. Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking. Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (Feb 2020).

Digital Library

[16]

Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Software (Sep 1978), 250–269. https://doi.org/10.1145/355791.355796

Digital Library

[17]

Charles AR Hoare. 1962. Quicksort. The computer journal 5, 1 (1962), 10–16.

[18]

Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. https://doi.org/10.1145/3208040.3208062

Digital Library

[19]

Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/sc41405.2020.00076

[20]

Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Scalable SIMD-Efficient Graph Processing on GPUs. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques(PACT ’15). 39–50.

Digital Library

[21]

Valentin Le Fèvre and Marc Casas. 2023. Efficient Execution of SpGEMM on Long Vector Architectures. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 101–113.

Digital Library

[22]

Kenli Li, Wangdong Yang, and Keqin Li. 2014. Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Transactions on Parallel and Distributed Systems 26, 1 (2014), 196–205.

[23]

Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards efficient spmv on sunway manycore architectures. In Proceedings of the 2018 International Conference on Supercomputing. 363–373.

Digital Library

[24]

Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. https://doi.org/10.1109/ipdps.2014.47

Digital Library

[25]

Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 370–381.

Digital Library

[26]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. 2019. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Comput. 90 (2019), 102545.

Digital Library

[27]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In 2017 46th International Conference on Parallel Processing (ICPP). https://doi.org/10.1109/icpp.2017.19

[28]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 90–106.

Digital Library

[29]

Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2020. spECK. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. https://doi.org/10.1145/3332466.3374521

Digital Library

[30]

Fazle Sadi, Joe Sweeney, Tze Meng Low, James C Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 347–358.

Digital Library

[31]

Qinglin Wang, Dongsheng Li, Xiandong Huang, Siqi Shen, Songzhu Mei, and Jie Liu. 2020. Optimizing FFT-based convolution on ARMv8 multi-core CPUs. In European Conference on Parallel Processing. Springer, 248–262.

Digital Library

[32]

Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. In Proceedings of the 24th symposium on principles and practice of parallel programming. 68–81.

Digital Library

[33]

Rui Xia, Xiao-Wei Guo, Chao Li, and Jie Liu. 2023. Direct numerical simulation of acoustic wave propagation in ocean waveguides using a parallel finite volume solver. Ocean Engineering 281 (2023), 114894.

[34]

Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Y Zomaya, and Tao Li. 2019. Caspmv: A customized and accelerative spmv framework for the sunway taihulight. IEEE Transactions on Parallel and Distributed Systems 32, 1 (2019), 131–146.

[35]

Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. Cvr: Efficient vectorization of spmv on x86 processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 149–162.

Digital Library

[36]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. Acm Sigplan Notices 49, 8 (2014), 107–118.

Digital Library

[37]

Carl Yang, Aydin Buluc, and JohnD. Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU. Cornell University - arXiv,Cornell University - arXiv (Mar 2018).

Digital Library

[38]

Wangdong Yang, Kenli Li, Zeyao Mo, and Keqin Li. 2014. Performance optimization using partitioned SpMV on GPUs and multicore CPUs. IEEE Trans. Comput. 64, 9 (2014), 2623–2636.

Digital Library

Index Terms

SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core Processors
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
      2. Vector / streaming algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
    2. Solvers

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Sparse general matrix-matrix multiplication (SpGEMM) is an important kernel in computational science and engineering, and has been widely studied on homogeneous processors, e.g., CPUs and GPUs. Recently, the asymmetric multicore processors (AMPs), ...
Accelerating Sparse General Matrix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU
HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Sparse general matrix-matrix multiplication (SpGEMM) is challenging especially on graphic accelerators. Existing solutions do not fully utilize the shared memory of the graphics accelerator. Our proposal could effectively utilize the graphics accelerator'...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the National Key Research and Development Program of China

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
4
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents