research-article

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

Authors:

Weifeng LiuAuthors Info & Claims

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 90 - 106

https://doi.org/10.1145/3503221.3508431

Published: 28 March 2022 Publication History

Abstract

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. The existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection.

We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. Thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. Thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. Thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

References

[1]

K. Akbudak and C. Aykanat. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 28(8):2258--2271, 2017.

Digital Library

[2]

K. Akbudak, O. Selvitopi, and C. Aykanat. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 4(3), 2018.

Digital Library

[3]

P. N. Q. Anh, R. Fan, and Y. Wen. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In ICS '16, 2016.

Digital Library

[4]

A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38(6):C624--C651, 2016.

[5]

A. Azad and A. Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In IPDPS '17, 2017.

[6]

A. Azad, A. Buluç, and J. R. Gilbert. Parallel triangle counting and enumeration using matrix algebra. In Workshop on Graph Algorithm Building Blocks '15, pages 804 -- 811, 2015.

Digital Library

[7]

A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç. HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research (NAR), 2018.

[8]

A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluç. Combinatorial blas 2.0: Scaling combinatorial algorithms on distributed-memory systems. IEEE Transactions on Parallel and Distributed Systems, 33(4):989--1001, 2022.

[9]

A. H. Baker, T. Gamblin, M. Schulz, and U. M. Yang. Challenges of scaling algebraic multigrid across modern multicore architectures. In IPDPS '11, pages 275--286, 2011.

Digital Library

[10]

G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices. In SPAA '13, 2013.

Digital Library

[11]

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1--155, 2014.

[12]

G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 3(3), 2016.

Digital Library

[13]

G. Ballard, C. Siefert, and J. Hu. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing, 38(3):C203--C231, 2016.

Digital Library

[14]

N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34(4):C123--C152, 2012.

Digital Library

[15]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09, pages 1--11, 2009.

Digital Library

[16]

T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali. Groute: An asynchronous multi-gpu programming model for irregular computations. In PPoPP '17, pages 235--248, 2017.

Digital Library

[17]

G. Bikshandi, B. B. Fraguela, J. Guo, M. J. Garzarán, G. Almási, J. Moreira, and D. Padua. Implementation of parallel numerical algorithms using hierarchically tiled arrays. In Languages and Compilers for High Performance Computing, pages 87--101, 2005.

Digital Library

[18]

M. Bisson and M. Fatica. A gpu implementation of the sparse deep neural network graph challenge. In HPEC '19, pages 1--8, 2019.

[19]

J. C. Brodman, G. C. Evans, M. Manguoglu, A. Sameh, M.J. Garzarán, and D. Padua. A parallel numerical solver using hierarchically tiled arrays. In LCPC '11, pages 46--61, 2011.

[20]

A. Buluç. Linear Algebraic Primitives for Parallel Computing on Large Graphs. PhD thesis, University of California, Santa Barbara, 2010.

[21]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In SPAA '09, pages 233--244, 2009.

Digital Library

[22]

A. Buluç and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP '08, pages 503--510, 2008.

Digital Library

[23]

A. Buluç and J. R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IPDPS '08, 2008.

[24]

A. Buluç and J. R. Gilbert. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal of Scientific Computing, 34(4):170--191, 2012.

Digital Library

[25]

A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang. Design of the GraphBLAS API for C. In Workshop on Graph Algorithm Building Blocks, 2017.

[26]

A. Buluç and J. R. Gilbert. The combinatorial blas: design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496--509, 2011.

Digital Library

[27]

A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS '11, pages 721--733, 2011.

[28]

Y. Chen, A. B. Hayes, C. Zhang, T. Salmon, and E. Z. Zhang. Locality-aware software throttling for sparse matrix operation on gpus. In USENIX ATC '18, pages 413--425, 2018.

[29]

Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li. Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Transactions on Parallel and Distributed Systems, 30(4):923--938, 2019.

Digital Library

[30]

S. Chou, F. Kjolstad, and S. Amarasinghe. Automatic generation of efficient sparse tensor format conversion routines. In PLDI '20, pages 823--838, 2020.

Digital Library

[31]

N. Corp. The cusparse library, 2020.

[32]

S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland. Optimizing sparse matrix operations on gpus using merge path. In IPDPS '15, pages 407--416, 2015.

Digital Library

[33]

S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014.

[34]

S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw., 41(4), 2015.

Digital Library

[35]

T. A. Davis. Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss. In HPEC '18, pages 1--6, 2018.

[36]

T. A. Davis. Algorithm 1000: Suitesparse:graphblas: Graph algorithms in the language of sparse linear algebra. ACM Trans. Math. Softw., 45(4), 2019.

[37]

T. A. Davis, M. Aznaveh, and S. Kolodziej. Write quick, run fast: Sparse deep neural network in 20 minutes of development time via suitesparse:graphblas. In HPEC '19, pages 1--6, 2019.

[38]

T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1), 2011.

Digital Library

[39]

G. V. Demirci and C. Aykanat. Cartesian partitioning models for 2d and 3d parallel spgemm algorithms. IEEE Transactions on Parallel and Distributed Systems, 31(12):2763--2775, 2020.

Digital Library

[40]

G. V. Demirci and C. Aykanat. Scaling sparse matrix-matrix multiplication in the accumulo database. Distributed and Parallel Databases, 38(1):31--62, 2020.

Digital Library

[41]

J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293--312, 2005.

[42]

M. Deveci, C. Trott, and S. Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In IPDPSW '17, pages 693--702, 2017.

[43]

M. Deveci, C. Trott, and S. Rajamanickam. Multithreaded sparse matrix-matrix multiplication for many-core and gpu architectures. Parallel Computing, 78:33 -- 46, 2018.

[44]

I. S. Duff, M. A. Heroux, and R. Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Trans. Math. Softw., 28(2):239--267, 2002.

Digital Library

[45]

I. S. Duff, M. Marrone, G. Radicati, and C. Vittoli. Level 3 basic linear algebra subprograms for sparse matrices: A user-level interface. ACM Trans. Math. Softw., 23(3):379--401, 1997.

Digital Library

[46]

J. A. Ellis and S. Rajamanickam. Scalable inference for sparse deep neural networks using kokkos kernels. In HPEC '19, pages 1--7, 2019.

[47]

R. D. Falgout and U. M. Yang. hypre: A library of high performance preconditioners. In ICCS '02, pages 632--641, 2002.

[48]

H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on hpc platforms. In ICS '11, pages 172--181, 2011.

Digital Library

[49]

H. Gahvari, W. Gropp, K. E. Jordan, M. Schulz, and U. M. Yang. Modeling the performance of an algebraic multigrid cycle using hybrid mpi/openmp. In ICPP '12, pages 128--137, 2012.

Digital Library

[50]

I. Gelado and M. Garland. Throughput-oriented gpu memory allocation. In PPoPP '19, page 27--37, 2019.

Digital Library

[51]

J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333--356, 1992.

Digital Library

[52]

C. Gómez, F. Mantovani, E. Focht, and M. Casas. Efficiently running spmv on long vector architectures. In PPoPP '21, page 292--303, 2021.

Digital Library

[53]

F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann. Gpu-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing, 37(1):C54--C71, 2015.

Digital Library

[54]

F. Gremse, K. Küpper, and U. Naumann. Memory-efficient sparse matrix-matrix multiplication by row merging on many-core architectures. SIAM Journal on Scientific Computing, 40(4):C429--C449, 2018.

[55]

Z. Gu, J. Moreira, D. Edelsohn, and A. Azad. Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking. In SPAA '20, pages 293--303, 2020.

Digital Library

[56]

G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluç. Parallel string graph construction and transitive reduction for de novo genome assembly. In IPDPS '21, pages 517--526, 2021.

[57]

F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw., 4(3):250--269, 1978.

Digital Library

[58]

C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, U. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan. Efficient sparse-matrix multi-vector product on gpus. In HPDC '18, pages 66--79, 2018.

Digital Library

[59]

C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP '19, pages 300--314, 2019.

Digital Library

[60]

K. Hou, W. Liu, H. Wang, and W.-c. Feng. Fast segmented sort on gpus. In ICS '17, 2017.

Digital Library

[61]

G. Huang, G. Dai, Y. Wang, and H. Yang. Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC '20, pages 1--12, 2020.

[62]

M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.

[63]

M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.

[64]

E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01, pages 127--136, 2001.

[65]

E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications, 18(1):135--158, 2004.

Digital Library

[66]

H. Ji, S. Lu, K. Hou, H. Wang, Z. Jin, W. Liu, and B. Vinter. Segmented merge: A new primitive for parallel sparse matrix computations. International Journal of Parallel Programming, pages 1--13, 2021.

Digital Library

[67]

J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson. Mathematical foundations of the GraphBLAS. In HPEC '16, 2016.

[68]

J. Kepner, D. Bader, A. Buluç, J. Gilbert, J. Kepner, T. Mattson, and H. Meyerhenke. Graphs, matrices, and the GraphBLAS: Seven good reasons. In ICCS '15, 2015.

Digital Library

[69]

F. Kjolstad, P. Ahrens, S. Kamil, and S. Amarasinghe. Tensor algebra compilation with workspaces. In CGO '19, pages 180--192, 2019.

[70]

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1, 2017.

Digital Library

[71]

N. Knight, E. Carson, and J. Demmel. Exploiting data sparsity in parallel matrix powers computations. In PPAM '14, pages 15--25, 2014.

[72]

P. Koanantakool, A. Azad, A. Buluç, D. Morozov, S.-Y. Oh, L. Oliker, and K. Yelick. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS '16, 2016.

[73]

V. Kotlyar, K. Pingali, and P. Stodghill. Compiling parallel code for sparse matrix applications. In SC '97, pages 1--18, 1997.

Digital Library

[74]

R. Kunchum, A. Chaudhry, A. Sukumaran-Rajam, Q. Niu, I. Nisa, and P. Sadayappan. On improving performance of sparse matrix-matrix multiplication on gpus. In ICS '17, 2017.

Digital Library

[75]

J. Lee, S. Kang, Y. Yu, Y.-Y. Jo, S.-W. Kim, and Y. Park. Optimization of gpu-based sparse matrix multiplication for large sparse networks. In ICDE '20, pages 925--936, 2020.

[76]

J. Li, J. Sun, and R. Vuduc. Hicoo: Hierarchical storage of sparse tensors. In SC '18, pages 238--252, 2018.

Digital Library

[77]

X. Li, Y. Liang, S. Yan, L. Jia, and Y. Li. A coordinated tiling and batching framework for efficient gemm on gpus. In PPoPP '19, page 229--241, 2019.

Digital Library

[78]

J. Liu, X. He, W. Liu, and G. Tan. Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In PPoPP '18, page 407--408, 2018.

Digital Library

[79]

J. Liu, X. He, W. Liu, and G. Tan. Register-aware optimizations for parallel sparse matrix-matrix multiplication. International Journal of Parallel Programming, 2019.

Digital Library

[80]

W. Liu and B. Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In IPDPS '14, pages 370--381, 2014.

Digital Library

[81]

W. Liu and B. Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In ICS '15, pages 339--350, 2015.

Digital Library

[82]

W. Liu and B. Vinter. A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors. Journal of Parallel and Distributed Computing, 85(C):47--61, 2015.

Digital Library

[83]

W. Liu and B. Vinter. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49(C):179--193, 2015.

[84]

Z. Lu, Y. Niu, and W. Liu. Efficient block algorithms for parallel sparse triangular solve. In ICPP '20, 2020.

Digital Library

[85]

S. Maleki, G. C. Evans, and D. A. Padua. Tiled linear algebra a system for parallel graph algorithms. In LCPC '15, pages 116--130, 2015.

[86]

T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. Standards for graph algorithm primitives. In HPEC '13, pages 1--2, 2013.

[87]

T. Mattson, T. A. Davis, M. Kumar, A. Buluç, S. McMillan, J. Moreira, and C. Yang. LAGraph: A community effort to collect graph algorithms built on top of the GraphBLAS. In GrAPL Workshop '19, 2019.

[88]

T. G. Mattson, C. Yang, S. McMillan, A. Buluç, and J. E. Moreira. GraphBLAS C API: Ideas for future versions of the specification. In HPEC '17, 2017.

[89]

D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multiplication. In SC '16, pages 678--689, 2016.

[90]

D. Merrill and M. Garland. Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In PPoPP '16, 2016.

Digital Library

[91]

M. S. Mohammadi, T. Yuki, K. Cheshmi, E. C. Davis, M. Hall, M. M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. M. Strout. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In PLDI '19, pages 594--609, 2019.

Digital Library

[92]

Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. High-performance sparse matrix-matrix products on intel KNL and multicore architectures. In ICPPW '18, 2018.

Digital Library

[93]

Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Computing, 90, 2019.

[94]

Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In IPDPS '21, pages 68--78, 2021.

[95]

S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski. Outerspace: An outer product based sparse matrix multiplication accelerator. In HPCA '18, pages 724--736, 2018.

[96]

M. Parger, M. Winter, D. Mlakar, and M. Steinberger. Speck: Accelerating gpu sparse matrix-matrix multiplication through lightweight analysis. In PPoPP '20, pages 362--375, 2020.

Digital Library

[97]

O. Selvitopi, M. T. Hussain, A. Azad, and A. Buluç. Optimizing high performance markov clustering for pre-exascale architectures. In IPDPS '20, pages 116--126, 2020.

[98]

M. M. Strout, M. Hall, and C. Olschanowsky. The sparse polyhedral framework: Composing compiler-generated inspector-executor code. Proceedings of the IEEE, 106(11):1921--1934, 2018.

[99]

R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16:521--530, 2005.

[100]

R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In SC '02, 2002.

[101]

R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In HPCC '05, pages 807--816, 2005.

[102]

X. Wang, W. Liu, W. Xue, and L. Wu. swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In PPoPP '18, pages 338--353, 2018.

Digital Library

[103]

Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng. Dual-side sparse tensor core. In ISCA '21, page 1083--1095, 2021.

Digital Library

[104]

J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS '06, pages 307--316, 2006.

Digital Library

[105]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07, 2007.

Digital Library

[106]

M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger. Adaptive sparse matrix-matrix multiplication on the gpu. In PPoPP '19, pages 68--81, 2019.

Digital Library

[107]

M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajamanickam. Fast linear algebra-based triangle counting with kokkoskernels. In HPEC '17, pages 1--7, 2017.

[108]

Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. Scaling sparse matrix multiplication on cpu-gpu nodes. In IPDPS '21, pages 392--401, 2021.

[109]

Z. Xie, G. Tan, W. Liu, and N. Sun. Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In ICS '19, pages 94--105, 2019.

Digital Library

[110]

Z. Xie, G. Tan, W. Liu, and N. Sun. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 33(1):159--175, 2022.

Digital Library

[111]

C. Yang, A. Buluç, and J. D. Owens. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. arXiv preprint, 2019.

[112]

C. Yang, A. Buluç, and J. D. Owens. Design principles for sparse matrix multiplication on the GPU. In Euro-Par '18, 2018.

Digital Library

[113]

C. Yang, A. Buluç, and J. D. Owens. Implementing push-pull efficiently in GraphBLAS. In ICPP '18, 2018.

[114]

A. Yaşar, S. Rajamanickam, J. Berry, M. Wolf, J. S. Young, and V. ÇatalyÜrek. Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : (update on static graph challenge). In HPEC '19, pages 1--4, 2019.

[115]

S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. Speeding up spmv for power-law graph analytics by enhancing locality & vectorization. In SC '20, 2020.

[116]

K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas? Proceedings of the IEEE, 93(2):358--386, 2005.

[117]

R. Yuster and U. Zwick. Fast sparse matrix multiplication. ACM Trans. Algorithms, 1(1):2--13, 2005.

Digital Library

[118]

O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Computers & Electrical Engineering, 88:106848, 2020.

[119]

Z. Zhang, H. Wang, S. Han, and W. J. Dally. Sparch: Efficient architecture for sparse matrix multiplication. In HPCA '20, pages 261--274, 2020.

Cited By

Wei BWang YChang FGao JJi W(2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsThe International Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 5-Feb-2024
https://doi.org/10.1177/10943420241231928
Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Show More Cited By

Index Terms

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
      2. Vector / streaming algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
    2. Solvers

Recommendations

Accelerating Sparse General Matrix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU
HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Sparse general matrix-matrix multiplication (SpGEMM) is challenging especially on graphic accelerators. Existing solutions do not fully utilize the shared memory of the graphics accelerator. Our proposal could effectively utilize the graphics accelerator'...
Adaptive sparse matrix-matrix multiplication on the GPU
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, ...
A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2022

495 pages

ISBN:9781450392044

DOI:10.1145/3503221

General Chair:
Jaejin Lee
Seoul National University
,
Program Chairs:
Kunal Agrawal
Washington University
,
Michael Spear
Lehigh University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

PPoPP '22

Sponsor:

PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 2 - 6, 2022

Seoul, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
2,007
Total Downloads

Downloads (Last 12 months)547
Downloads (Last 6 weeks)38

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wei BWang YChang FGao JJi W(2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsThe International Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 5-Feb-2024
https://doi.org/10.1177/10943420241231928
Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Lin CLuo WFang YMa CLiu XMa Y(2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654959
Huang HYao PAn ZSun YHu AXu PZheng LLiao XJin HDe V(2024)SpaHet: A Software/Hardware Co-design for Accelerating Heterogeneous-Sparsity based Sparse Matrix MultiplicationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655944(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655944
Fang JShen YWang YChen L(2024)STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators AutomaticallyProceedings of the ACM on Management of Data10.1145/36393232:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639323
Bank Tavakoli ERiera MQuraishi MRen F(2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: Apr-2024
https://doi.org/10.1109/TVLSI.2024.3355499
Yang DZhao YNiu YJia WShao ELiu WTant GJin Z(2024)Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00064(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00064
Lu YZeng LWang TFu XLi WCheng HYang DJin ZCasas MLiu W(2024)AmgT: Algebraic Multigrid Solver on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00058
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten