Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3588195.3593002acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Efficient Algorithm Design of Optimizing SpMV on GPU

Published: 07 August 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Sparse matrix-vector multiplication (SpMV) is a fundamental building block for various numerical computing applications. However, most existing GPU-SpMV approaches may suffer from either long preprocessing overhead, load imbalance, format conversion, bad memory access patterns. In this paper, we proposed two new SpMV algorithms:flat andline-enhance, as well as their implementations, for GPU systems to overcome the above shortcomings. Our algorithms work directly on the CSR sparse matrix format. To achieve high performance: 1) for load balance, theflat algorithm uses non-zero splitting andline-enhance uses a mix of row and non-zero splitting; 2) memory access patterns are designed for both algorithms for data loading, storing and reduction steps; and 3) an adaptive approach is proposed to select appropriate algorithm and parameters based on matrix characteristics.
    We evaluate our methods using theSuiteSparse Matrix Collection on AMD and NVIDIA GPU platforms. Average performance improvements of 424%, 741%, 49%, 46%, 72% are achieved when comparing our adaptive approach with CSR-Vector, CSR-Adaptive, HOLA, cuSparse and merge-based SpMV, respectively. In bandwidth tests, our approach can also achieve a high memory bandwidth, which is very close to the peak memory bandwidth.

    References

    [1]
    Sarah AlAhmadi, Thaha Muhammed, Rashid Mehmood, and Aiiad Albeshri. 2020. Performance Characteristics for Sparse Matrix-Vector Multiplication on GPUs. In Smart Infrastructure and Applications, Rashid Mehmood, Simon See, Iyad Katib, and Imrich Chlamtac (Eds.). Springer International Publishing, Cham, 409--426. https://doi.org/10.1007/978--3-030--13705--2_17 Series Title: EAI/Springer Innovations in Communication and Computing.
    [2]
    Atakan Altinkaynak. 2017. An efficient sparse matrix-vector multiplication on CUDA-enabled graphic processing units for finite element method simulations. Internat. J. Numer. Methods Engrg. 110, 1 (2017), 57--78. https://doi.org/10.1002/nme.5346
    [3]
    Takuya Araki. 2017. Accelerating Machine Learning on Sparse Datasets with a Distributed Memory Vector Architecture. In 2017 16th International Symposium on Parallel and Distributed Computing (ISPDC). IEEE, Innsbruck, 112--121. https://doi.org/10.1109/ISPDC.2017.21
    [4]
    Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarath, and P. Sadayappan. 2014. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, 781--792. https://doi.org/10.1109/SC.2014.69
    [5]
    Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09. ACM Press, Portland, Oregon, 1. https://doi.org/10.1145/1654059.1654078
    [6]
    Gabriele Capannini. 2011. Designing efficient parallel prefix sum algorithms for gpus. In 2011 IEEE 11th International Conference on Computer and Information Technology. IEEE, 189--196.
    [7]
    Yuedan Chen, Guoqing Xiao, Fan Wu, Zhuo Tang, and Keqin Li. 2020. tpSpMV: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures. Information Sciences 523 (June 2020), 279--295. https://doi.org/10.1016/j.ins.2020.03.020
    [8]
    Mayank Daga and Joseph L. Greathouse. 2015. Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices. In 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). IEEE, Bengaluru, India, 64--74. https://doi.org/10.1109/HiPC.2015.55
    [9]
    Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland. 2015. Optimizing Sparse Matrix Operations on GPUs Using Merge Path. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, India, 407--416. https://doi.org/10.1109/IPDPS.2015.98
    [10]
    Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663
    [11]
    Istvan R eguly and Mike Giles. 2012. Efficient sparse matrix-vector multiplication on cache-based GPUs. In 2012 Innovative Parallel Computing (InPar). IEEE, San Jose, CA, USA, 1--12. https://doi.org/10.1109/InPar.2012.6339602
    [12]
    Athena Elafrou, Georgios Goumas, and Nectarios Koziris. 2017. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors. In 2017 46th International Conference on Parallel Processing (ICPP). 292--301. https://doi.org/10.1109/ICPP.2017.38
    [13]
    Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse Matrix-Vector Multiplication on GPGPUs. ACM Trans. Math. Software 43, 4 (March 2017), 1--49. https://doi.org/10.1145/3017994
    [14]
    Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication. In 2014 IEEE 22nd Annual International Symposium on Field- Programmable Custom Computing Machines. IEEE, Boston, MA, USA, 36--43. https://doi.org/10.1109/FCCM.2014.23
    [15]
    Jiaquan Gao, Panpan Qi, and Guixia He. 2016. Efficient CSR-Based Sparse Matrix- Vector Multiplication on GPU. Mathematical Problems in Engineering 2016 (2016), 1--14. https://doi.org/10.1155/2016/4596943
    [16]
    Michael Garland. 2008. Sparse matrix computations on manycore GPU's. In Proceedings of the 45th annual conference on Design automation - DAC '08. ACM Press, Anaheim, California, 2. https://doi.org/10.1145/1391469.1391473
    [17]
    Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing 50, 1 (Oct. 2009), 36--77. https://doi.org/10.1007/s11227-008-0251--8
    [18]
    Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, 769--780. https://doi.org/10.1109/SC.2014.68
    [19]
    Paul Grigoras, Pavel Burovskiy, Wayne Luk, and Spencer Sherwin. 2016. Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 1--9. https://doi.org/10.1109/FPL.2016.7577352
    [20]
    Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA. GPU gems 3, 39 (2007), 851--876.
    [21]
    Guixia He and Jiaquan Gao. 2016. A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs. Mathematical Problems in Engineering 2016 (2016), 1--12. https://doi.org/10.1155/2016/8471283
    [22]
    Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, Tempe Arizona, 66--79. https://doi.org/10.1145/3208040.3208062
    [23]
    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).
    [24]
    Mohsen Koohi Esfahani, Peter Kilpatrick, and Hans Vandierendonck. 2021. Exploiting In-Hub Temporal Locality In SpMV-Based Graph Processing. In 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP 2021). Association for Computing Machinery, New York, NY, USA, Article 42, 10 pages. https://doi.org/10.1145/3472456.3472462
    [25]
    Zbigniew Koza, Maciej Matyka, Sebastian Szkoda, and Lukasz Miroslaw. 2014. Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units. SIAM Journal on Scientific Computing 36, 2 (Jan. 2014), C219--C239. https://doi.org/10.1137/120900216
    [26]
    Daniel Langr and Pavel Tvrdik. 2016. Evaluation Criteria for Sparse Matrix Storage Formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (Feb. 2016), 428--440. https://doi.org/10.1109/TPDS.2015.2401575
    [27]
    Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, Newport Beach California USA, 339--350. https://doi.org/10.1145/2751205.2751209
    [28]
    Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, Newport Beach California USA, 339--350. https://doi.org/10.1145/2751205.2751209
    [29]
    Yongchao Liu and Bertil Schmidt. 2018. LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows. Journal of Signal Processing Systems 90, 1 (Jan. 2018), 69--86. https://doi.org/10.1007/s11265-016--1216--4
    [30]
    Duane Merrill and Michael Garland. 2016. Merge-Based Parallel Sparse Matrix-Vector Multiplication. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, 678--689. https://doi.org/10.1109/SC.2016.57
    [31]
    Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU. Procedia Computer Science 80 (2016), 131--142. https://doi.org/10.1016/j.procs.2016.05.304
    [32]
    Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Portland, OR, USA, 68--78. https://doi.org/10.1109/IPDPS49936.2021.00016
    [33]
    Guillermo Oyarzun, Daniel Peyrolon, Carlos Alvarez, and Xavier Martorell. 2021. An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations. arXiv:2107.12371 [physics.comp-ph]
    [34]
    Karl Rupp, Florian Rudolf, and Josef Weinbub. 2016. ViennaCL - A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. SIAM Journal on Scientific Computing 38 (2016), 6.
    [35]
    Markus Steinberger, Andreas Derlery, Rhaleb Zayer, and Hans-Peter Seidel. 2016. How naive is naive SpMV on the GPU?. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, Waltham, MA, USA, 1--8. https://doi.org/10.1109/HPEC.2016.7761634
    [36]
    Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU. In Proceedings of the International Conference on Supercomputing - ICS '17. ACM Press, Chicago, Illinois, 1--11. https://doi.org/10.1145/3079079.3079086
    [37]
    J. Wong, E. Kuhl, and E. Darve. 2015. A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems: A NEW SPARSE MATRIX VECTOR MULTIPLICATION GPU ALGORITHM. Internat. J. Numer. Methods Engrg. 102, 12 (June 2015), 1784--1814. https://doi.org/10.1002/nme.4865
    [38]
    Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: yet another SpMV framework on GPUs. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14. ACM Press, Orlando, Florida, USA, 107--118. https://doi.org/10.1145/2555243.2555255
    [39]
    Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speeding Up SpMV for Power-Law Graph Analytics by Enhancing Locality & Vectorization. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Atlanta, GA, USA, 1--15. https://doi.org/10.1109/SC41405.2020.00090
    [40]
    Haoran Zhao, Tian Xia, Chenyang Li, Wenzhe Zhao, Nanning Zheng, and Pengju Ren. 2020. Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon. In 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE, Hartford, CT, USA, 601--609. https://doi.org/10.1109/ICCD50377.2020.00105
    [41]
    Xavier Álvarez Farré, Andrey Gorobets, and F. Xavier Trias. 2021. A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers. Computers & Fluids 214 (2021), 104768. https://doi.org/10.1016/j.compfluid.2020.104768

    Cited By

    View all
    • (2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
    • (2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
    • (2024)Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR formatScientific Reports10.1038/s41598-024-67462-314:1Online publication date: 17-Jul-2024

    Index Terms

    1. Efficient Algorithm Design of Optimizing SpMV on GPU

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
      August 2023
      350 pages
      ISBN:9798400701559
      DOI:10.1145/3588195
      • General Chair:
      • Ali R. Butt,
      • Program Chairs:
      • Ningfang Mi,
      • Kyle Chard
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 August 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CSR
      2. GPU
      3. SpMV
      4. linear algebra
      5. sparse matrix

      Qualifiers

      • Research-article

      Funding Sources

      • National Key R&D Program of China

      Conference

      HPDC '23

      Acceptance Rates

      Overall Acceptance Rate 166 of 966 submissions, 17%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)381
      • Downloads (Last 6 weeks)23
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
      • (2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
      • (2024)Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR formatScientific Reports10.1038/s41598-024-67462-314:1Online publication date: 17-Jul-2024

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media