research-article

spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis

Authors:

Mathias Parger,

Markus SteinbergerAuthors Info & Claims

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 362 - 375

https://doi.org/10.1145/3332466.3374521

Published: 19 February 2020 Publication History

Abstract

Sparse general matrix-matrix multiplication on GPUs is challenging due to the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance for certain types of matrices, but fail to accelerate all kinds of matrices in the same manner. Our approach combines multiple strategies with dynamic parameter selection to dynamically choose and tune the best fitting algorithm for each row of the matrix. This choice is supported by a lightweight, multi-level matrix analysis, which carefully balances analysis cost and expected performance gains. Our evaluation on thousands of matrices with various characteristics shows that we outperform all currently available solutions in 79% over all matrices with >15k products and that we achieve the second best performance in 15%. For these matrices, our solution is on average 83% faster than the second best approach and up to 25X faster than other state-of-the-art GPU implementations. Using our approach, applications can expect great performance independent of the matrices they work on.

References

[1]

Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. (2016), 1--12.

Digital Library

[2]

Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal on Scientific Computing 34, 4 (jan 2012), C123--C152.

[3]

Nathan Bell and Michael Garland. 2012. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. http://cusp-library.googlecode.com

[4]

Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland. 2015. Optimizing Sparse Matrix Operations on GPUs Using Merge Path. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 407--416.

Digital Library

[5]

Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing Sparse Matrix-Matrix Multiplication for the GPU. ACM Trans. Math. Softw 41 (2015).

Digital Library

[6]

Timothy A Davis. 2017. SuiteSparse: A Suite of Sparse matrix packages. http://www.cise.ufl.edu/~davis/.

[7]

M. Deveci, C. Trott, and S. Rajamanickam. 2017. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 693--702.

[8]

Donald Knuth. 1963. NOTES ON OPEN ADDRESSING. Technical Report. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.4899{&}rep=rep1{&}type=pdf

[9]

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 769--780.

Digital Library

[10]

Felix Gremse, Andreas Höfter, Lars Ole Schwen, Fabian Kiessling, and Uwe Naumann. 2015. GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging. SIAM Journal on Scientific Computing 37, 1 (2015), C54--C71.

Digital Library

[11]

Felix Gremse, Kerstin Küpper, and Uwe Naumann. 2018. Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures. SIAM Journal on Scientific Computing 40 (01 2018), C429--C449.

[12]

Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, et al. 2016. Mathematical foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--9.

[13]

Rakshith Kunchum, Ankur Chaudhry, Aravind Sukumaran-Rajam, Qingpeng Niu, Israt Nisa, and P. Sadayappan. 2017. On improving performance of sparse matrix-matrix multiplication on GPUs. 11 (2017), 1--11.

Digital Library

[14]

Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS (2014), 370--381.

Digital Library

[15]

Md Mostofa, Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey. 2015. Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms. High Performance Computing (2015), 48--57.

[16]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 101--110.

[17]

NVIDIA. 2019. The API reference guide for cuSPARSE, the CUDA sparse matrix library. (v9.1 ed.). NVIDIA.

[18]

Viral Shah and John R. Gilbert. 2010. Sparse Matrices in Matlab^*P: Design and Implementation. Technical Report. 144--155 pages.

[19]

Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019. Adaptive sparse matrix-matrix multiplication on the GPU. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming - PPoPP '19. ACM Press, New York, New York, USA, 68--81.

Digital Library

[20]

Rhaleb Zayer, Markus Steinberger, and Hans-Peter Seidel. 2017. A GPU-Adapted Structure for Unstructured Grids. Computer Graphics Forum 36, 2 (2017), 495--507. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.l3144

Digital Library

Cited By

Wei BWang YChang FGao JJi W(2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsInternational Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1177/10943420241231928
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Li MXiao JZhang KLin ZShui CMeng KWang ZPang YTan G(2024)A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3661896(460-472)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3661896
Show More Cited By

Index Terms

spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

On the HSS iteration methods for positive definite Toeplitz linear systems

We study the HSS iteration method for large sparse non-Hermitian positive definite Toeplitz linear systems, which first appears in Bai, Golub and Ng's paper published in 2003 [Z.-Z. Bai, G.H. Golub, M.K. Ng, Hermitian and skew-Hermitian splitting ...
Accurate Eigenvalues and SVDs of Totally Nonnegative Matrices

We consider the class of totally nonnegative (TN) matrices---matrices all of whose minors are nonnegative. Any nonsingular TN matrix factors as a product of nonnegative bidiagonal matrices. The entries of the bidiagonal factors parameterize the set of ...
Improved Differential Cryptanalysis on SPECK Using Plaintext Structures
Information Security and Privacy
Abstract
Plaintext structures are a commonly-used technique for improving differential cryptanalysis. Generally, there are two types of plaintext structures: multiple-differential structures and truncated-differential structures. Both types have been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2020

454 pages

ISBN:9781450368186

DOI:10.1145/3332466

General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Qualifiers

Research-article

Funding Sources

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22 - 26, 2020

California, San Diego

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,094
Total Downloads

Downloads (Last 12 months)212
Downloads (Last 6 weeks)27

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei BWang YChang FGao JJi W(2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsInternational Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1177/10943420241231928
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Li MXiao JZhang KLin ZShui CMeng KWang ZPang YTan G(2024)A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3661896(460-472)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3661896
Bank Tavakoli ERiera MQuraishi MRen F(2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: Apr-2024
https://doi.org/10.1109/TVLSI.2024.3355499
Gurevin DShan MHuang SHasan MDing CKhan O(2024)PruneGNN: Algorithm-Architecture Pruning Framework for Graph Neural Network Acceleration2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00019(108-123)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00019
Swann ROsama MSangaiah KMahmud JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Seer: Predictive Runtime Kernel Selection for Irregular ProblemsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444812(133-142)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444812
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/s11227-024-06234-2
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Le Fèvre VCasas MButt AMi NChard K(2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593000
Sun WLi AGeng TStuijk SCorporaal H(2023)Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric BehaviorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321782434:1(246-261)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3217824
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents