Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Published: 01 January 2015 Publication History

Abstract

We present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as GPUs. SpGEMM is implemented by iterative row merging, similar to merge sort, except that elements with duplicate column indices are aggregated on the fly. The main kernel merges small numbers of sparse rows at once using subwarps of threads to realize an early compression effect which reduces the overhead of global memory accesses. The performance is compared with a parallel CPU implementation as well as with three GPU-based implementations. Measurements performed for computing the matrix square for 21 sparse matrices show that the proposed method consistently outperforms the other methods. Analysis showed that the performance is achieved by utilizing the compression effect and the GPU caching architecture. An improved performance was also found for computing Galerkin products which are required by algebraic multigrid solvers. The performance was particularly good for seven-point stencil matrices arising in the context of diffuse optical imaging and the improved performance allows one to perform image reconstruction at higher resolution using the same computational resources.

References

[1]
R. R. Amossen and R. Pagh, Faster join-projects and sparse matrix multiplications, in Proceedings of the 12th International Conference on Database Theory, ACM, New York, 2009, pp. 121--126.
[2]
E. A. Attardo and A. Borsic, GPU acceleration of algebraic multigrid for low-frequency finite element methods, in IEEE Antennas and Propagation Society International Symposium, 2012, pp. 1--2.
[3]
R. E. Bank and C. C. Douglas, Sparse matrix multiplication package (SMMP), Adv. Comput. Math., 1 (1993), pp. 127--137.
[4]
J. Becerra-Sagredo, C. Málaga, and F. Mandujano, A novel and scalable multigrid algorithm for many-core architectures, arXiv:1108.2045v1 [cs.NA], 2011.
[5]
N. Bell, S. Dalton, and L. N. Olson, Exposing fine-grained parallelism in algebraic multigrid methods, SIAM J. Sci. Comput., 34 (2012), pp. C123--C152.
[6]
N. Bell and M. Garland, Cusp: Generic Parallel Algorithms For Sparse Matrix And Graph Computations, 2013, version 0.4.0; http://cusp-library.googlecode.com.
[7]
J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, Sparse matrix solvers on the GPU: Conjugate gradients and multigrid, ACM Trans. Graph., 22 (2003), pp. 917--924.
[8]
W. L Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial, 2nd ed., SIAM, Philadelphia, 2000.
[9]
A. Buluç and J. R. Gilbert, On the representation and multiplication of hypersparse matrices, in Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008, IPDPS 2008, IEEE, 2008, pp. 1--11.
[10]
A. Buluç and J. R. Gilbert, Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments, SIAM J. Sci. Comput., 34 (2012), pp. C170--C191.
[11]
D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, J. Symbolic Comput., 9 (1990), pp. 251--280.
[12]
S. Dalton, N. Bell, and L. Olson, Optimizing Sparse Matrix-Matrix Multiplication for the GPU, Technical report, NVIDIA, Santa Clara, CA, 2013.
[13]
T. A. Davis and Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Software, 38 (2011), article 1.
[14]
J. Demouth, Sparse matrix-matrix multiplication on the GPU, presentation, Nvidia, GTC Technology Conference, 2012.
[15]
J. E. Dendy, Jr., Black box multigrid, J. Comput. Phys., 48 (1982), pp. 366--386.
[16]
R. D. Falgout, An Introduction to Algebraic Multigrid, Technical report, Lawrence Livermore National Laboratory, Livermore, CA, 2006.
[17]
M. Garland and D. B. Kirk, Understanding throughput-oriented architectures, Commun. ACM, 53 (2010), pp. 58--66.
[18]
J. R. Gilbert, V. B. Shah, and S. Reinhardt, A unified framework for numerical and combinatorial computing, Comput. Sci. Eng., 10 (2008), pp. 20--25.
[19]
D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, and S. Turek, Using GPUs to improve multigrid solver performance on a cluster, International Journal of Computational Science and Engineering, 4 (2008), pp. 36--55.
[20]
C. Gregg and K. Hazelwood, Where is the data? Why you cannot debate CPU vs. GPU performance without the answer, in IEEE International Symposium on Performance Analysis of Systems and Software, 2011, pp. 134--144.
[21]
F. Gremse, B. Theek, S. Kunjachan, W. Lederle, A. Pardo, S. Barth, T. Lammers, U. Naumann, and F. Kiessling, Absorption reconstruction improves biodistribution assessment of fluorescent nanoprobes using hybrid fluorescence-mediated tomography, Theranostics, 4 (2014), pp. 960--971.
[22]
M. Griebel, B. Metsch, D. Oeltz, and M. A. Schweitzer, Coarse grid classification: A parallel coarsening scheme for algebraic multigrid methods, Numer. Linear Algebra Appl., 13 (2006), pp. 193--214.
[23]
M. Griebel, B. Metsch, and M. A. Schweitzer, Coarse Grid Classification---Part II: Automatic Coarse Grid Agglomeration for Parallel AMG, Technical Report 271, Universität Bonn, Bonn, Germany, 2006.
[24]
A. Griewank and U. Naumann, Accumulating Jacobians as chained sparse matrix products, Math. Program., 95 (2003), pp. 555--571.
[25]
F. G. Gustavson, Two fast algorithms for sparse matrices: Multiplication and permuted transposition, ACM Trans. Math. Software, 4 (1978), pp. 250--269.
[26]
V. E. Henson and U. M. Yang, BoomerAMG: A parallel algebraic multigrid solver and preconditioner, Appl. Numer. Math., 41 (2002), pp. 155--177.
[27]
J. Hoberock and N. Bell, Thrust: A Parallel Template Library, 2013; version 1.7.0, http://thrust.github.io/.
[28]
D. E. Knuth, The Art of Computer Programming, Vol. 3, Sorting and Searching, Addison-Wesley, Reading, MA, 1998.
[29]
J. Kraus and M. Förster, Efficient AMG on heterogeneous systems, in Facing the Multicore-Challenge II, Lecture Notes in Comput. Sci. 7174, Springer, Berlin, Heidelberg, 2012, pp. 133--146.
[30]
S. Kunjachan, F. Gremse, B. Theek, P. Koczera, R. Pola, M. Pechar, T. Etrych, K. L. Ulbrich, G. Storm, F. Kiessling, and T. Lammers, Noninvasive optical imaging of nanomedicine biodistribution, ACS Nano, 7 (2013), pp. 252--262.
[31]
W. Liu and B. Vinter, An efficient GPU general sparse matrix-matrix multiplication for irregular data, in Proceedings of the IEEE 28th International Parallel & Distributed Processing Symposium, 2014.
[32]
K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, Sparse matrix-matrix multiplication on modern architectures, in Proceedings of the 19th International Conference on High Performance Computing, 2012, pp. 1--10.
[33]
A. Napov and Y. Notay, An algebraic multigrid method with guaranteed convergence rate, SIAM J. Sci. Comput., 34 (2012), pp. A1079--A1109.
[34]
U. Naumann and O. Schenk, eds., Combinatorial Scientific Computing, Computational Science Series, Chapman & Hall / CRC Press, Taylor and Francis Group, Boca Raton, FL, 2012.
[35]
NVIDIA Corporation, CUDA C Programming Guide, 2014, version 6.0, http://developer.nvidia.com/cuda .
[36]
NVIDIA Corporation, Cusparse Library, 2014, version 6.0, http://developer.nvidia.com/cusparse.
[37]
G. Penn, Efficient transitive closure of sparse matrices over closed semirings, Theoret. Comput. Sci., 354 (2006), pp. 72--81.
[38]
E. H. Rubensson, E. Rudberg, and P. Sałek, Sparse matrix algebra for quantum modeling of large systems, in Applied Parallel Computing. State of the Art in Scientific Computing, B. K\ragström, E. Elmroth, J. Dongarra, and J. Waśniewski, eds., Lecture Notes in Comput. Sci. 4699, Springer, Berlin, Heidelberg, 2007, pp. 90--99.
[39]
V. Strassen, Gaussian elimination is not optimal, Numer. Math., 13 (1969), pp. 354--356.
[40]
K. Stüben, An Introduction to Algebraic Multigrid, in Multigrid, Academic Press, San Diego, 2001, pp. 413--532.
[41]
P. D. Sulatycke and K. Ghose, Caching-efficient multithreaded fast multiplication of sparse matrices, in Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, 1998, pp. 117--123.
[42]
S. Van Dongen, Graph clustering via a discrete uncoupling process, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 121--141.
[43]
V. V. Williams, Multiplying matrices faster than Coppersmith--Winograd, in Proceedings of the 44th Symposium on Theory of Computing, ACM, New York, 2012, pp. 887--898.
[44]
L. Wang, X. Hu, J. Cohen, and J. Xu, A parallel auxiliary grid algebraic multigrid method for graphic processing units, SIAM J. Sci. Comput., 35 (2013), pp. C263--C283.
[45]
R. Yuster and U. Zwick, Fast sparse matrix multiplication, ACM Trans. Algorithms, 1 (2005), pp. 2--13.

Cited By

View all
  • (2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
  • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
  • (2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing
SIAM Journal on Scientific Computing  Volume 37, Issue 1
2015
755 pages
ISSN:1064-8275
DOI:10.1137/sjoce3.37.1
Issue’s Table of Contents

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2015

Author Tags

  1. sparse matrix-matrix multiplication
  2. GPU programming
  3. algebraic multigrid
  4. fluorescence-mediated tomography

Author Tags

  1. 65F50
  2. 65Y20
  3. 65M06

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
  • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
  • (2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: 1-Apr-2024
  • (2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
  • (2023)A New Sparse GEneral Matrix-matrix Multiplication Method for Long Vector Architecture by Hierarchical Row MergingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3625131(756-759)Online publication date: 12-Nov-2023
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • (2023)A Systematic Survey of General Sparse Matrix-matrix MultiplicationACM Computing Surveys10.1145/357115755:12(1-36)Online publication date: 2-Mar-2023
  • (2022)A one-for-all and o(v log(v ))-cost solution for parallel merge style operations on sorted key-value arraysProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507728(669-682)Online publication date: 28-Feb-2022
  • (2022)TileSpGEMMProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508431(90-106)Online publication date: 2-Apr-2022
  • (2021)A Compressed, Divide and Conquer Algorithm for Scalable Distributed Matrix-Matrix MultiplicationThe International Conference on High Performance Computing in Asia-Pacific Region10.1145/3432261.3432271(110-119)Online publication date: 20-Jan-2021
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media