research-article

Public Access

Load-balancing Sparse Matrix Vector Product Kernels on GPUs

Authors:

Stanimire Tomov,

Yuhsiang M. Tsai,

Weichung WangAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 7, Issue 1

Article No.: 2, Pages 1 - 26

https://doi.org/10.1145/3380930

Published: 29 March 2020 Publication History

All formats PDF

Abstract

Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.

References

[1]

Edward Anderson, Zhaojun Bai, Jack Dongarra, Anne Greenbaum, Alan McKenney, Jeremy Du Croz, Sven Hammarling, James Demmel, Christian Bischof, and Danny Sorensen. 1990. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society Press, Los Alamitos, CA, 2–11. http://dl.acm.org/citation.cfm?id=110382.110385.

[2]

Hartwig Anzt, Yen-Chen Chen, Terry Cojean, Jack Dongarra, Goran Flegar, Pratik Nayak, Enrique S. Quintana-Ortí, Yuhsiang M. Tsai, and Weichung Wang. 2019. Towards continuous benchmarking: An automated performance evaluation framework for high performance software. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’19). ACM, New York, NY, Article 9, 11 pages.

Digital Library

[3]

Hartwig Anzt, Edmond Chow, and Jack Dongarra. 2016. On Block-asynchronous Execution on GPUs. Technical Report 291. LAPACK Working Note.

[4]

Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. 68 (Oct. 2017), 32–44.

[5]

Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2014. Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ Formats on NVIDIA GPUs. Technical Report ut-eecs-14-727. University of Tennessee.

[6]

Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Viktor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA.

[7]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, Article 18, 11 pages.

Digital Library

[8]

Better Scientific Software (BSSw). Retrieved August 2018 from https://bssw.io/.

[9]

Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selection methods. Comput. Electr. Eng. 40, 1 (Jan. 2014), 16–28.

Digital Library

[10]

Gianna M. Del Corso. 1997. Estimating an eigenvector by the power method with a random start. SIAM J. Matrix Anal. Appl. 18, 4 (Oct. 1997), 913–937.

Digital Library

[11]

Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland. 2015. Optimizing sparse matrix operations on GPUs using merge path. In 2015 IEEE International Parallel and Distributed Processing Symposium. 407–416.

Digital Library

[12]

Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse matrix-vector multiplication on GPGPUs. ACM Trans. Math. Softw. 43, 4, Article 30 (Jan. 2017), 49 pages.

Digital Library

[13]

Goran Flegar and Hartwig Anzt. 2017. Overcoming load imbalance for irregular sparse matrices. In Proceedings of the 7th Workshop on Irregular Applications: Architectures and Algorithms (IA3’17). ACM, New York, NY, Article 2, 8 pages.

Digital Library

[14]

Goran Flegar and Enrique S. Quintana-Ortí. 2017. Balanced CSR sparse matrix-vector product on graphics processors. In Euro-Par 2017: Parallel Processing, Francisco F. Rivera, Tomás F. Pena, and José C. Cabaleiro (Eds.). Springer International Publishing, Cham, 697–709.

[15]

Nicholas Gould and Jennifer Scott. 2016. A note on performance profiles for benchmarking software. ACM Trans. Math. Softw. 43, 2, Article 15 (Aug. 2016), 5 pages.

Digital Library

[16]

Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, and Vivek Sarkar. 2016. A survey of sparse matrix-vector multiplication performance on large matrices. CoRR abs/1608.00636 (2016). arxiv:1608.00636 http://arxiv.org/abs/1608.00636

[17]

Desmond Higham and Nick Higham. 2005. Matlab Guide. Society for Industrial and Applied Mathematics. arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898717891

[18]

Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, February 16-20, 2019. 300–314.

[19]

Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, February 12–16, 2011. 267–276.

Digital Library

[20]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD Units. SIAM J. Scientific Computing 36, 5 (2014), C401–C423. arXiv:http://dx.doi.org/10.1137/130930352

[21]

Amy N. Langville and Carl D. Meyer. 2012. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ.

[22]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339–350.

Digital Library

[23]

Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ, Article 58, 12 pages. http://dl.acm.org/citation.cfm?id=3014904.3014982.

Digital Library

[24]

Duane Merrill, Michael Garland, and Andrew S. Grimshaw. 2015. High-performance and scalable GPU graph traversal. TOPC 1, 2 (2015), 14:1–14:30.

[25]

Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10). Springer-Verlag, Berlin, 111–125.

Digital Library

[26]

NVIDIA Corp.2017. Whitepaper: NVIDIA TESLA V100 GPU ARCHITECTURE.

[27]

[NVIDIA Corporation 2018] NVIDIA Corporation 2018. NVIDIA CUDA Toolkit (9.2 ed.). NVIDIA Corporation.

[28]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. The PageRank citation ranking: Bringing order to the Web. In Proceedings of the 7th International World Wide Web Conference. Brisbane, Australia, 161–172. citeseer.nj.nec.com/page98pagerank.html.

[29]

Tobias Ribizel and Hartwig Anzt. 2019. Approximate and exact selection on GPUs. In The 9th International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), Vol. Available online: http://bit.ly/SampleSelectGPU.

[30]

SuiteSparse. 2018. Matrix Collection. Retrieved April 2018 from https://sparse.tamu.edu.

[31]

xSDK. Extreme-scale Scientific Software Development Kit. Retrieved August 2018 from https://xsdk.info/.

Cited By

Gao JJi WWang Y(2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/3676847Online publication date: 8-Jul-2024
https://doi.org/10.1145/3676847
Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Show More Cited By

Index Terms

Load-balancing Sparse Matrix Vector Product Kernels on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On
High Performance Computing
Abstract
Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best ...
Efficient sparse-matrix multi-vector product on GPUs
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high ...
Flexible batched sparse matrix-vector product on GPUs
ScalA '17: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 7, Issue 1

Special Issue on Innovations in Systems for Irregular Applications, Part 1 and Regular Paper

March 2020

182 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3387354

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2020

Accepted: 01 October 2019

Revised: 01 September 2019

Received: 01 December 2018

Published in TOPC Volume 7, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,960
Total Downloads

Downloads (Last 12 months)413
Downloads (Last 6 weeks)45

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao JJi WWang Y(2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/3676847Online publication date: 8-Jul-2024
https://doi.org/10.1145/3676847
Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Aliaga JAnzt HGrützmacher TQuintana-Ortí ETomás A(2023)Compressed basis GMRES on high-performance graphics processing unitsInternational Journal of High Performance Computing Applications10.1177/1094342022111514037:2(82-100)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1177/10943420221115140
Chen YChung Y(2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605579
Wang HYang WOuyang RHu RLi KLi K(2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3584373
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Guo LKestor G(2023)On Higher-performance Sparse Tensor Transposition2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00118(697-701)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00118
Li WCheng HLu ZLu YLiu W(2023)HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00025(209-220)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00025
Chen SFang JXu CWang Z(2022)Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUsApplied Sciences10.3390/app1219981212:19(9812)Online publication date: 29-Sep-2022
https://doi.org/10.3390/app12199812
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents