research-article

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Authors:

Felix Gremse, Andreas Höfter,

Lars Ole Schwen,

Fabian Kiessling,

Uwe NaumannAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 37, Issue 1

Pages C54 - C71

https://doi.org/10.1137/130948811

Published: 01 January 2015 Publication History

Abstract

We present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as GPUs. SpGEMM is implemented by iterative row merging, similar to merge sort, except that elements with duplicate column indices are aggregated on the fly. The main kernel merges small numbers of sparse rows at once using subwarps of threads to realize an early compression effect which reduces the overhead of global memory accesses. The performance is compared with a parallel CPU implementation as well as with three GPU-based implementations. Measurements performed for computing the matrix square for 21 sparse matrices show that the proposed method consistently outperforms the other methods. Analysis showed that the performance is achieved by utilizing the compression effect and the GPU caching architecture. An improved performance was also found for computing Galerkin products which are required by algebraic multigrid solvers. The performance was particularly good for seven-point stencil matrices arising in the context of diffuse optical imaging and the improved performance allows one to perform image reconstruction at higher resolution using the same computational resources.

References

[1]

R. R. Amossen and R. Pagh, Faster join-projects and sparse matrix multiplications, in Proceedings of the 12th International Conference on Database Theory, ACM, New York, 2009, pp. 121--126.

[2]

E. A. Attardo and A. Borsic, GPU acceleration of algebraic multigrid for low-frequency finite element methods, in IEEE Antennas and Propagation Society International Symposium, 2012, pp. 1--2.

[3]

R. E. Bank and C. C. Douglas, Sparse matrix multiplication package (SMMP), Adv. Comput. Math., 1 (1993), pp. 127--137.

[4]

J. Becerra-Sagredo, C. Málaga, and F. Mandujano, A novel and scalable multigrid algorithm for many-core architectures, arXiv:1108.2045v1 [cs.NA], 2011.

[5]

N. Bell, S. Dalton, and L. N. Olson, Exposing fine-grained parallelism in algebraic multigrid methods, SIAM J. Sci. Comput., 34 (2012), pp. C123--C152.

[6]

N. Bell and M. Garland, Cusp: Generic Parallel Algorithms For Sparse Matrix And Graph Computations, 2013, version 0.4.0; http://cusp-library.googlecode.com.

[7]

J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, Sparse matrix solvers on the GPU: Conjugate gradients and multigrid, ACM Trans. Graph., 22 (2003), pp. 917--924.

[8]

W. L Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial, 2nd ed., SIAM, Philadelphia, 2000.

[9]

A. Buluç and J. R. Gilbert, On the representation and multiplication of hypersparse matrices, in Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008, IPDPS 2008, IEEE, 2008, pp. 1--11.

[10]

A. Buluç and J. R. Gilbert, Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments, SIAM J. Sci. Comput., 34 (2012), pp. C170--C191.

[11]

D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, J. Symbolic Comput., 9 (1990), pp. 251--280.

[12]

S. Dalton, N. Bell, and L. Olson, Optimizing Sparse Matrix-Matrix Multiplication for the GPU, Technical report, NVIDIA, Santa Clara, CA, 2013.

[13]

T. A. Davis and Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Software, 38 (2011), article 1.

[14]

J. Demouth, Sparse matrix-matrix multiplication on the GPU, presentation, Nvidia, GTC Technology Conference, 2012.

[15]

J. E. Dendy, Jr., Black box multigrid, J. Comput. Phys., 48 (1982), pp. 366--386.

[16]

R. D. Falgout, An Introduction to Algebraic Multigrid, Technical report, Lawrence Livermore National Laboratory, Livermore, CA, 2006.

[17]

M. Garland and D. B. Kirk, Understanding throughput-oriented architectures, Commun. ACM, 53 (2010), pp. 58--66.

[18]

J. R. Gilbert, V. B. Shah, and S. Reinhardt, A unified framework for numerical and combinatorial computing, Comput. Sci. Eng., 10 (2008), pp. 20--25.

[19]

D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, and S. Turek, Using GPUs to improve multigrid solver performance on a cluster, International Journal of Computational Science and Engineering, 4 (2008), pp. 36--55.

[20]

C. Gregg and K. Hazelwood, Where is the data? Why you cannot debate CPU vs. GPU performance without the answer, in IEEE International Symposium on Performance Analysis of Systems and Software, 2011, pp. 134--144.

[21]

F. Gremse, B. Theek, S. Kunjachan, W. Lederle, A. Pardo, S. Barth, T. Lammers, U. Naumann, and F. Kiessling, Absorption reconstruction improves biodistribution assessment of fluorescent nanoprobes using hybrid fluorescence-mediated tomography, Theranostics, 4 (2014), pp. 960--971.

[22]

M. Griebel, B. Metsch, D. Oeltz, and M. A. Schweitzer, Coarse grid classification: A parallel coarsening scheme for algebraic multigrid methods, Numer. Linear Algebra Appl., 13 (2006), pp. 193--214.

[23]

M. Griebel, B. Metsch, and M. A. Schweitzer, Coarse Grid Classification---Part II: Automatic Coarse Grid Agglomeration for Parallel AMG, Technical Report 271, Universität Bonn, Bonn, Germany, 2006.

[24]

A. Griewank and U. Naumann, Accumulating Jacobians as chained sparse matrix products, Math. Program., 95 (2003), pp. 555--571.

[25]

F. G. Gustavson, Two fast algorithms for sparse matrices: Multiplication and permuted transposition, ACM Trans. Math. Software, 4 (1978), pp. 250--269.

[26]

V. E. Henson and U. M. Yang, BoomerAMG: A parallel algebraic multigrid solver and preconditioner, Appl. Numer. Math., 41 (2002), pp. 155--177.

[27]

J. Hoberock and N. Bell, Thrust: A Parallel Template Library, 2013; version 1.7.0, http://thrust.github.io/.

[28]

D. E. Knuth, The Art of Computer Programming, Vol. 3, Sorting and Searching, Addison-Wesley, Reading, MA, 1998.

[29]

J. Kraus and M. Förster, Efficient AMG on heterogeneous systems, in Facing the Multicore-Challenge II, Lecture Notes in Comput. Sci. 7174, Springer, Berlin, Heidelberg, 2012, pp. 133--146.

[30]

S. Kunjachan, F. Gremse, B. Theek, P. Koczera, R. Pola, M. Pechar, T. Etrych, K. L. Ulbrich, G. Storm, F. Kiessling, and T. Lammers, Noninvasive optical imaging of nanomedicine biodistribution, ACS Nano, 7 (2013), pp. 252--262.

[31]

W. Liu and B. Vinter, An efficient GPU general sparse matrix-matrix multiplication for irregular data, in Proceedings of the IEEE 28th International Parallel & Distributed Processing Symposium, 2014.

[32]

K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, Sparse matrix-matrix multiplication on modern architectures, in Proceedings of the 19th International Conference on High Performance Computing, 2012, pp. 1--10.

[33]

A. Napov and Y. Notay, An algebraic multigrid method with guaranteed convergence rate, SIAM J. Sci. Comput., 34 (2012), pp. A1079--A1109.

[34]

U. Naumann and O. Schenk, eds., Combinatorial Scientific Computing, Computational Science Series, Chapman & Hall / CRC Press, Taylor and Francis Group, Boca Raton, FL, 2012.

[35]

NVIDIA Corporation, CUDA C Programming Guide, 2014, version 6.0, http://developer.nvidia.com/cuda .

[36]

NVIDIA Corporation, Cusparse Library, 2014, version 6.0, http://developer.nvidia.com/cusparse.

[37]

G. Penn, Efficient transitive closure of sparse matrices over closed semirings, Theoret. Comput. Sci., 354 (2006), pp. 72--81.

[38]

E. H. Rubensson, E. Rudberg, and P. Sałek, Sparse matrix algebra for quantum modeling of large systems, in Applied Parallel Computing. State of the Art in Scientific Computing, B. K\ragström, E. Elmroth, J. Dongarra, and J. Waśniewski, eds., Lecture Notes in Comput. Sci. 4699, Springer, Berlin, Heidelberg, 2007, pp. 90--99.

[39]

V. Strassen, Gaussian elimination is not optimal, Numer. Math., 13 (1969), pp. 354--356.

[40]

K. Stüben, An Introduction to Algebraic Multigrid, in Multigrid, Academic Press, San Diego, 2001, pp. 413--532.

[41]

P. D. Sulatycke and K. Ghose, Caching-efficient multithreaded fast multiplication of sparse matrices, in Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, 1998, pp. 117--123.

[42]

S. Van Dongen, Graph clustering via a discrete uncoupling process, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 121--141.

[43]

V. V. Williams, Multiplying matrices faster than Coppersmith--Winograd, in Proceedings of the 44th Symposium on Theory of Computing, ACM, New York, 2012, pp. 887--898.

[44]

L. Wang, X. Hu, J. Cohen, and J. Xu, A parallel auxiliary grid algebraic multigrid method for graphic processing units, SIAM J. Sci. Comput., 35 (2013), pp. C263--C283.

[45]

R. Yuster and U. Zwick, Fast sparse matrix multiplication, ACM Trans. Algorithms, 1 (2005), pp. 2--13.

Cited By

Wang YChang FWei BGao JJi W(2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3664924
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Bank Tavakoli ERiera MQuraishi MRen F(2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TVLSI.2024.3355499
Show More Cited By

Recommendations

Adaptive sparse tiling for sparse matrix multiplication
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...
Fast Structured Matrix Computations: Tensor Rank and Cohn---Umans Method

We discuss a generalization of the Cohn---Umans method, a potent technique developed for studying the bilinear complexity of matrix multiplication by embedding matrices into an appropriate group algebra. We investigate how the Cohn---Umans method may be ...
Rgs-SpMM: Accelerate Sparse Matrix-Matrix Multiplication by Row Group Splitting Strategy on the GPU
Network and Parallel Computing
Abstract
The Sparse Matrix-Matrix Multiplication (SpMM) operation is widely used in different fields, especially the recently popular GNN framework. Researchers have designed many kernels on the GPU to accelerate the SpMM operation. Existing methods mostly ...

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 37, Issue 1

2015

755 pages

ISSN:1064-8275

DOI:10.1137/sjoce3.37.1

Issue’s Table of Contents

© 2015, Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2015

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang YChang FWei BGao JJi W(2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3664924
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Bank Tavakoli ERiera MQuraishi MRen F(2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TVLSI.2024.3355499
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/s11227-024-06234-2
Takayashiki HYagi HNishimoto HYoshifuji N(2023)A New Sparse GEneral Matrix-matrix Multiplication Method for Long Vector Architecture by Hierarchical Row MergingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3625131(756-759)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3625131
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Gao JJi WChang FHan SWei BLiu ZWang Y(2023)A Systematic Survey of General Sparse Matrix-matrix MultiplicationACM Computing Surveys10.1145/357115755:12(1-36)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3571157
Wang BDeng LSun FDai GLiu LWang YXie YFalsafi BFerdman MLu SWenisch T(2022)A one-for-all and o(v log(v ))-cost solution for parallel merge style operations on sorted key-value arraysProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507728(669-682)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507728
Niu YLu ZJi HSong SJin ZLiu WLee JAgrawal KSpear M(2022)TileSpGEMMProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508431(90-106)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508431
Rasouli MKirby RSundar H(2021)A Compressed, Divide and Conquer Algorithm for Scalable Distributed Matrix-Matrix MultiplicationThe International Conference on High Performance Computing in Asia-Pacific Region10.1145/3432261.3432271(110-119)Online publication date: 20-Jan-2021
https://dl.acm.org/doi/10.1145/3432261.3432271
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents