research-article

Efficient Algorithm Design of Optimizing SpMV on GPU

Authors:

Changjun HuAuthors Info & Claims

HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Pages 115 - 128

https://doi.org/10.1145/3588195.3593002

Published: 07 August 2023 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for various numerical computing applications. However, most existing GPU-SpMV approaches may suffer from either long preprocessing overhead, load imbalance, format conversion, bad memory access patterns. In this paper, we proposed two new SpMV algorithms:flat andline-enhance, as well as their implementations, for GPU systems to overcome the above shortcomings. Our algorithms work directly on the CSR sparse matrix format. To achieve high performance: 1) for load balance, theflat algorithm uses non-zero splitting andline-enhance uses a mix of row and non-zero splitting; 2) memory access patterns are designed for both algorithms for data loading, storing and reduction steps; and 3) an adaptive approach is proposed to select appropriate algorithm and parameters based on matrix characteristics.

We evaluate our methods using theSuiteSparse Matrix Collection on AMD and NVIDIA GPU platforms. Average performance improvements of 424%, 741%, 49%, 46%, 72% are achieved when comparing our adaptive approach with CSR-Vector, CSR-Adaptive, HOLA, cuSparse and merge-based SpMV, respectively. In bandwidth tests, our approach can also achieve a high memory bandwidth, which is very close to the peak memory bandwidth.

References

[1]

Sarah AlAhmadi, Thaha Muhammed, Rashid Mehmood, and Aiiad Albeshri. 2020. Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs. In Smart Infrastructure and Applications, Rashid Mehmood, Simon See, Iyad Katib, and Imrich Chlamtac (Eds.). Springer International Publishing, Cham, 409--426. https://doi.org/10.1007/978--3-030--13705--2_17 Series Title: EAI/Springer Innovations in Communication and Computing.

[2]

Atakan Altinkaynak. 2017. An efficient sparse matrix-vector multiplication on CUDA-enabled graphic processing units for finite element method simulations. Internat. J. Numer. Methods Engrg. 110, 1 (2017), 57--78. https://doi.org/10.1002/nme.5346

[3]

Takuya Araki. 2017. Accelerating Machine Learning on Sparse Datasets with a Distributed Memory Vector Architecture. In 2017 16th International Symposium on Parallel and Distributed Computing (ISPDC). IEEE, Innsbruck, 112--121. https://doi.org/10.1109/ISPDC.2017.21

[4]

Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarath, and P. Sadayappan. 2014. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, 781--792. https://doi.org/10.1109/SC.2014.69

Digital Library

[5]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09. ACM Press, Portland, Oregon, 1. https://doi.org/10.1145/1654059.1654078

Digital Library

[6]

Gabriele Capannini. 2011. Designing efficient parallel prefix sum algorithms for gpus. In 2011 IEEE 11th International Conference on Computer and Information Technology. IEEE, 189--196.

Digital Library

[7]

Yuedan Chen, Guoqing Xiao, Fan Wu, Zhuo Tang, and Keqin Li. 2020. tpSpMV: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures. Information Sciences 523 (June 2020), 279--295. https://doi.org/10.1016/j.ins.2020.03.020

[8]

Mayank Daga and Joseph L. Greathouse. 2015. Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices. In 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). IEEE, Bengaluru, India, 64--74. https://doi.org/10.1109/HiPC.2015.55

Digital Library

[9]

Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland. 2015. Optimizing Sparse Matrix Operations on GPUs Using Merge Path. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, India, 407--416. https://doi.org/10.1109/IPDPS.2015.98

Digital Library

[10]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663

Digital Library

[11]

Istvan R eguly and Mike Giles. 2012. Efficient sparse matrix-vector multiplication on cache-based GPUs. In 2012 Innovative Parallel Computing (InPar). IEEE, San Jose, CA, USA, 1--12. https://doi.org/10.1109/InPar.2012.6339602

[12]

Athena Elafrou, Georgios Goumas, and Nectarios Koziris. 2017. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors. In 2017 46th International Conference on Parallel Processing (ICPP). 292--301. https://doi.org/10.1109/ICPP.2017.38

[13]

Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse Matrix-Vector Multiplication on GPGPUs. ACM Trans. Math. Software 43, 4 (March 2017), 1--49. https://doi.org/10.1145/3017994

Digital Library

[14]

Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication. In 2014 IEEE 22nd Annual International Symposium on Field- Programmable Custom Computing Machines. IEEE, Boston, MA, USA, 36--43. https://doi.org/10.1109/FCCM.2014.23

[15]

Jiaquan Gao, Panpan Qi, and Guixia He. 2016. Efficient CSR-Based Sparse Matrix- Vector Multiplication on GPU. Mathematical Problems in Engineering 2016 (2016), 1--14. https://doi.org/10.1155/2016/4596943

[16]

Michael Garland. 2008. Sparse matrix computations on manycore GPU's. In Proceedings of the 45th annual conference on Design automation - DAC '08. ACM Press, Anaheim, California, 2. https://doi.org/10.1145/1391469.1391473

Digital Library

[17]

Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing 50, 1 (Oct. 2009), 36--77. https://doi.org/10.1007/s11227-008-0251--8

Digital Library

[18]

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, 769--780. https://doi.org/10.1109/SC.2014.68

Digital Library

[19]

Paul Grigoras, Pavel Burovskiy, Wayne Luk, and Spencer Sherwin. 2016. Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 1--9. https://doi.org/10.1109/FPL.2016.7577352

[20]

Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA. GPU gems 3, 39 (2007), 851--876.

[21]

Guixia He and Jiaquan Gao. 2016. A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs. Mathematical Problems in Engineering 2016 (2016), 1--12. https://doi.org/10.1155/2016/8471283

[22]

Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, Tempe Arizona, 66--79. https://doi.org/10.1145/3208040.3208062

Digital Library

[23]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).

[24]

Mohsen Koohi Esfahani, Peter Kilpatrick, and Hans Vandierendonck. 2021. Exploiting In-Hub Temporal Locality In SpMV-Based Graph Processing. In 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP 2021). Association for Computing Machinery, New York, NY, USA, Article 42, 10 pages. https://doi.org/10.1145/3472456.3472462

Digital Library

[25]

Zbigniew Koza, Maciej Matyka, Sebastian Szkoda, and Lukasz Miroslaw. 2014. Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units. SIAM Journal on Scientific Computing 36, 2 (Jan. 2014), C219--C239. https://doi.org/10.1137/120900216

Digital Library

[26]

Daniel Langr and Pavel Tvrdik. 2016. Evaluation Criteria for Sparse Matrix Storage Formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (Feb. 2016), 428--440. https://doi.org/10.1109/TPDS.2015.2401575

Digital Library

[27]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, Newport Beach California USA, 339--350. https://doi.org/10.1145/2751205.2751209

Digital Library

[28]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, Newport Beach California USA, 339--350. https://doi.org/10.1145/2751205.2751209

Digital Library

[29]

Yongchao Liu and Bertil Schmidt. 2018. LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows. Journal of Signal Processing Systems 90, 1 (Jan. 2018), 69--86. https://doi.org/10.1007/s11265-016--1216--4

Digital Library

[30]

Duane Merrill and Michael Garland. 2016. Merge-Based Parallel Sparse Matrix-Vector Multiplication. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, 678--689. https://doi.org/10.1109/SC.2016.57

[31]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU. Procedia Computer Science 80 (2016), 131--142. https://doi.org/10.1016/j.procs.2016.05.304

Digital Library

[32]

Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Portland, OR, USA, 68--78. https://doi.org/10.1109/IPDPS49936.2021.00016

[33]

Guillermo Oyarzun, Daniel Peyrolon, Carlos Alvarez, and Xavier Martorell. 2021. An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations. arXiv:2107.12371 [physics.comp-ph]

[34]

Karl Rupp, Florian Rudolf, and Josef Weinbub. 2016. ViennaCL - A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. SIAM Journal on Scientific Computing 38 (2016), 6.

Digital Library

[35]

Markus Steinberger, Andreas Derlery, Rhaleb Zayer, and Hans-Peter Seidel. 2016. How naive is naive SpMV on the GPU?. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, Waltham, MA, USA, 1--8. https://doi.org/10.1109/HPEC.2016.7761634

[36]

Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU. In Proceedings of the International Conference on Supercomputing - ICS '17. ACM Press, Chicago, Illinois, 1--11. https://doi.org/10.1145/3079079.3079086

Digital Library

[37]

J. Wong, E. Kuhl, and E. Darve. 2015. A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems: A NEW SPARSE MATRIX VECTOR MULTIPLICATION GPU ALGORITHM. Internat. J. Numer. Methods Engrg. 102, 12 (June 2015), 1784--1814. https://doi.org/10.1002/nme.4865

[38]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: yet another SpMV framework on GPUs. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14. ACM Press, Orlando, Florida, USA, 107--118. https://doi.org/10.1145/2555243.2555255

Digital Library

[39]

Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speeding Up SpMV for Power-Law Graph Analytics by Enhancing Locality & Vectorization. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--15. https://doi.org/10.1109/SC41405.2020.00090

[40]

Haoran Zhao, Tian Xia, Chenyang Li, Wenzhe Zhao, Nanning Zheng, and Pengju Ren. 2020. Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon. In 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE, Hartford, CT, USA, 601--609. https://doi.org/10.1109/ICCD50377.2020.00105

[41]

Xavier Álvarez Farré, Andrey Gorobets, and F. Xavier Trias. 2021. A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers. Computers & Fluids 214 (2021), 104768. https://doi.org/10.1016/j.compfluid.2020.104768

Cited By

Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Hu ZSun JLi ZSun G(2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00010
Ma MHuang XXu JJia D(2024)Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR formatScientific Reports10.1038/s41598-024-67462-314:1Online publication date: 17-Jul-2024
https://doi.org/10.1038/s41598-024-67462-3

Index Terms

Efficient Algorithm Design of Optimizing SpMV on GPU
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...
GPU Implementation of Image Convolution Using Sparse Model with Efficient Storage Format

With the growth of data parallel computing, role of GPU computing in non-graphic applications such as image processing becomes a focus in research fields. Convolution is an integral operation in filtering, smoothing and edge detection. In this article, ...
GTLB:A Load-Balanced SpMV Computation Method on GPU
HP3C '23: Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications

Sparse Matrix-Vector Multiplication (SpMV) has been widely used in the field of scientific computing. Optimization of SpMV’s computational performance can bring significant benefits to its usage. In recent years, with the development of GPU hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

August 2023

350 pages

ISBN:9798400701559

DOI:10.1145/3588195

General Chair:
Ali R. Butt
Virginia Tech, USA
,
Program Chairs:
Ningfang Mi
Northeastern University, USA
,
Kyle Chard
University of Chicago & Argonne National Laboratory, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China

Conference

HPDC '23

Sponsor:

HPDC '23: The 32nd International Symposium on High-Performance Parallel and Distributed Computing

June 16 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
381
Total Downloads

Downloads (Last 12 months)381
Downloads (Last 6 weeks)23

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Hu ZSun JLi ZSun G(2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00010
Ma MHuang XXu JJia D(2024)Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR formatScientific Reports10.1038/s41598-024-67462-314:1Online publication date: 17-Jul-2024
https://doi.org/10.1038/s41598-024-67462-3

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents